GraphStorm 0.3: Scalable, multi-task studying on graphs with user-friendly APIs

GraphStorm is a low-code enterprise graph machine studying (GML) framework to construct, practice, and deploy graph ML options on advanced enterprise-scale graphs in days as a substitute of months. With GraphStorm, you’ll be able to construct options that straight take into consideration the construction of relationships or interactions between billions of entities, that are inherently embedded in most real-world information, together with fraud detection eventualities, suggestions, neighborhood detection, and search/retrieval issues.

Right now, we’re launching GraphStorm 0.3, including native assist for multi-task studying on graphs. Particularly, GraphStorm 0.3 permits you to outline a number of coaching targets on completely different nodes and edges inside a single coaching loop. As well as, GraphStorm 0.3 provides new APIs to customise GraphStorm pipelines: you now solely want 12 strains of code to implement a customized node classification coaching loop. That will help you get began with the brand new API, we’ve revealed two Jupyter pocket book examples: one for node classification, and one for a hyperlink prediction process. We additionally launched a complete research of co-training language fashions (LM) and graph neural networks (GNN) for big graphs with wealthy textual content options utilizing the Microsoft Tutorial Graph (MAG) dataset from our KDD 2024 paper. The research showcases the efficiency and scalability of GraphStorm on textual content wealthy graphs and one of the best practices of configuring GML coaching loops for higher efficiency and effectivity.

Native assist for multi-task studying on graphs

Many enterprise functions have graph information related to a number of duties on completely different nodes and edges. For instance, retail organizations need to conduct fraud detection on each sellers and consumers. Scientific publishers need to discover extra associated works to quote of their papers and wish to pick the best topic for his or her publication to be discoverable. To higher mannequin such functions, prospects have requested us to assist multi-task studying on graphs.

GraphStorm 0.3 helps multi-task studying on graphs with six most typical duties: node classification, node regression, edge classification, edge regression, hyperlink prediction, and node function reconstruction. You possibly can specify the coaching targets by means of a YAML configuration file. For instance, a scientific writer can use the next YAML configuration to concurrently outline a paper topic classification process on paper nodes and a hyperlink prediction process on paper-citing-paper edges for the scientific writer use case:

model: 1.0
    gsf:
        primary: # primary settings of the spine GNN mannequin
            ...
        ...
        multi_task_learning:
            - node_classification:         # outline a node classification process for paper topic prediction.
                target_ntype: "paper"      # the paper nodes are the coaching targets.
                label_field: "label_class" # the node function "label_class" accommodates the coaching labels.
				mask_fields:
                    - "train_mask_class"   # practice masks is known as as train_mask_class.
                    - "val_mask_class"     # validation masks is known as as val_mask_class.
                    - "test_mask_class"    # check masks is known as as test_mask_class.
                num_classes: 10            # There are whole 10 completely different lessons (topic) to foretell.
                task_weight: 1.0           # The duty weight is 1.0.
                
            - link_prediction:                # outline a hyperlink prediction paper quotation suggestion.
                num_negative_edges: 4         # Pattern 4 destructive edges for every optimistic edge throughout coaching
                num_negative_edges_eval: 100  # Pattern 100 destructive edges for every optimistic edge throughout analysis
                train_negative_sampler: joint # Share the destructive edges between optimistic edges (to speedup coaching)
                train_etype:
                    - "paper,citing,paper"    # The goal edge sort for hyperlink prediction coaching is "paper, citing, paper"
                mask_fields:
                    - "train_mask_lp"         # practice masks is known as as train_mask_lp.
                    - "val_mask_lp"           # validation masks is known as as val_mask_lp.
                    - "test_mask_lp"          # check masks is known as as test_mask_lp.
                task_weight: 0.5              # The duty weight is 0.5.

For extra particulars about the way to run graph multi-task studying with GraphStorm, seek advice from Multi-task Studying in GraphStorm in our documentation.

New APIs to customise GraphStorm pipelines and parts

Since GraphStorm’s launch in early 2023, prospects have primarily used its command line interface (CLI), which abstracts away the complexity of the graph ML pipeline so that you can rapidly construct, practice, and deploy fashions utilizing frequent recipes. Nonetheless, prospects are telling us that they need an interface that enables them to customise the coaching and inference pipeline of GraphStorm to their particular necessities extra simply. Based mostly on buyer suggestions for the experimental APIs we launched in GraphStorm 0.2, GraphStorm 0.3 introduces refactored graph ML pipeline APIs. With the brand new APIs, you solely want 12 strains of code to outline a customized node classification coaching pipeline, as illustrated by the next instance:

import graphstorm as gs
gs.initialize()

acm_data = gs.dataloading.GSgnnData(part_config='./acm_gs_1p/acm.json')

train_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_train_set(ntypes=['paper']), fanout=[20, 20], batch_size=64)
val_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_val_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)
test_dataloader = gs.dataloading.GSgnnNodeDataLoader(dataset=acm_data, target_idx=acm_data.get_node_test_set(ntypes=['paper']), fanout=[100, 100], batch_size=256, train_task=False)

mannequin = RgcnNCModel(g=acm_data.g, num_hid_layers=2, hid_size=128, num_classes=14)
evaluator = gs.eval.GSgnnClassificationEvaluator(eval_frequency=100)

coach = gs.coach.GSgnnNodePredictionTrainer(mannequin)
coach.setup_evaluator(evaluator)

coach.match(train_dataloader, val_dataloader, test_dataloader, num_epochs=5)

That will help you get began with the brand new APIs, we even have launched new Jupyter pocket book examples in our Documentation and Tutorials web page.

Complete research of LM+GNN for big graphs with wealthy textual content options

Many enterprise functions have graphs with textual content options. In retail search functions, for instance, buying log information gives insights on how text-rich product descriptions, search queries, and buyer habits are associated. Foundational massive language fashions (LLMs) alone aren’t appropriate to mannequin such information as a result of the underlying information distributions and relationships don’t correspond to what LLMs study from their pre-training information corpuses. GML, then again, is nice for modeling associated information (graphs) however till now, GML practitioners needed to manually mix their GML fashions with LLMs to mannequin textual content options and get one of the best efficiency for his or her use circumstances. Particularly when the underlying graph dataset was massive, this handbook work was difficult and time-consuming.

In GraphStorm 0.2, GraphStorm launched built-in methods to coach language fashions (LMs) and GNN fashions collectively effectively at scale on large text-rich graphs. Since then, prospects have been asking us for steering on how GraphStorm’s LM+GNN methods needs to be employed to optimize efficiency. To deal with this, with GraphStorm 0.3, we launched a LM+GNN benchmark utilizing the big graph dataset, Microsoft Tutorial Graph (MAG), on two normal graph ML duties: node classification and hyperlink prediction. The graph dataset is a heterogeneous graph, accommodates tons of of thousands and thousands of nodes and billions of edges, and nearly all of nodes are attributed with wealthy textual content options. The detailed statistics of the datasets are proven within the following desk.

Dataset	Num. of nodes	Num. of edges	Num. of node/edge sorts	Num. of nodes in NC coaching set	Num. of edges in LP coaching set	Num. of nodes with text-features
MAG	484,511,504	7,520,311,838	4/4	28,679,392	1,313,781,772	240,955,156

We benchmark two major LM-GNN strategies in GraphStorm: pre-trained BERT+GNN, a baseline technique that’s extensively adopted, and fine-tuned BERT+GNN, launched by GraphStorm builders in 2022. With the pre-trained BERT+GNN technique, we first use a pre-trained BERT mannequin to compute embeddings for node textual content options after which practice a GNN mannequin for prediction. With the fine-tuned BERT+GNN technique, we initially fine-tune the BERT fashions on the graph information and use the ensuing fine-tuned BERT mannequin to compute embeddings which can be then used to coach a GNN fashions for prediction. GraphStorm gives other ways to fine-tune the BERT fashions, relying on the duty sorts. For node classification, we fine-tune the BERT mannequin on the coaching set with the node classification duties; for hyperlink prediction, we fine-tune the BERT mannequin with the hyperlink prediction duties. Within the experiment, we use 8 r5.24xlarge situations for information processing and use 4 g5.48xlarge situations for mannequin coaching and inference. The fine-tuned BERT+GNN method has as much as 40% higher efficiency (hyperlink prediction on MAG) in comparison with pre-trained BERT+GNN.

The next desk exhibits the mannequin efficiency of the 2 strategies and the general computation time of the entire pipeline ranging from information processing and graph development. NC means node classification and LP means hyperlink prediction. LM Time Value means the time spent on computing BERT embeddings and the time spent on fine-tuning the BERT fashions for pre-trained BERT+GNN and fine-tuned BERT+GNN, respectively.

Dataset	Job	Knowledge processing time	Goal	Pre-trained BERT + GNN			Effective-tuned BERT + GNN
Dataset	Job	Knowledge processing time	Goal	LM Time Value	One epoch time	Metric	LM Time Value	One epoch time	Metric
MAG	NC	553 min	paper topic	206 min	135 min	Acc:0.572	1423 min	137 min	Acc:0.633
MAG	LP	553 min	cite	198 min	2195 min	Mrr: 0.487	4508 min	2172 min	Mrr: 0.684

We additionally benchmark GraphStorm on massive artificial graphs to showcase its scalability. We generate three artificial graphs with 1 billion, 10 billion, and 100 billion edges. The corresponding coaching set sizes are 8 million, 80 million, and 800 million, respectively. The next desk exhibits the computation time of graph preprocessing, graph partition, and mannequin coaching. General, GraphStorm allows graph development and mannequin coaching on 100 billion scale graphs inside hours!

Graph Measurement	Knowledge pre-process		Graph Partition		Mannequin Coaching
Graph Measurement	# situations	Time	# situations	Time	# situations	Time
1B	4	19 min	4	8 min	4	1.5 min
10B	8	31 min	8	41 min	8	8 min
100B	16	61 min	16	416 min	16	50 min

Extra benchmark particulars and outcomes can be found in our KDD 2024 paper.

Conclusion

GraphStorm 0.3 is revealed below the Apache-2.0 license that can assist you sort out your large-scale graph ML challenges, and now affords native assist for multi-task studying and new APIs to customise pipelines and different parts of GraphStorm. Confer with the GraphStorm GitHub repository and documentation to get began.

In regards to the Creator

Xiang Tune is a senior utilized scientist at AWS AI Analysis and Training (AIRE), the place he develops deep studying frameworks together with GraphStorm, DGL and DGL-KE. He led the event of Amazon Neptune ML, a brand new functionality of Neptune that makes use of graph neural networks for graphs saved in graph database. He’s now main the event of GraphStorm, an open-source graph machine studying framework for enterprise use circumstances. He obtained his Ph.D. in pc programs and structure on the Fudan College, Shanghai, in 2014.

Jian Zhang is a senior utilized scientist who has been utilizing machine studying methods to assist prospects resolve varied issues, resembling fraud detection, ornament picture era, and extra. He has efficiently developed graph-based machine studying, significantly graph neural community, options for patrons in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public displays in regards to the GNN, the Deep Graph Library (DGL), Amazon Neptune, and different AWS providers.

Florian Saupe is a Principal Technical Product Supervisor at AWS AI/ML analysis supporting science groups just like the graph machine studying group, and ML Programs groups engaged on massive scale distributed coaching, inference, and fault resilience. Earlier than becoming a member of AWS, Florian lead technical product administration for automated driving at Bosch, was a technique marketing consultant at McKinsey & Firm, and labored as a management programs/robotics scientist – a area through which he holds a phd.