Recently Inception Labs’ Mercury Coder and Google’s Gemini Diffusion
have shown some advantages of dLLM over AR LLM.
Open sourced dLLM models like LLaDA and Dream are catching eyes.
Various speeding up techniques like dKV-Cache and Fast-dLLM
are being developed.
GraphGPT’s pre-training objective SMTP is directly adopted from MaskGIT, which is deeply connected with dLLM’s training objective.
We show that SMTP pre-training is better than NTP (employed by AR LLM) in most graph datasets and tasks.
Given the dominance of diffusion in images, audios, videos and graphs, it could serve as a promising candidate
for unifying majority modalities.
GraphGPT’s results imply that dLLM can directly model the serialized graph data as in GraphGPT. This could help dLLM unify
the graph modality very easily, and also benefit the graph research community, since it naturally serves as the
graph foundation model.
Hiring
Campus Recruitng is ongoing in fields such as Agent, LLM, MLLM, AIGC, AI4Sci, and more.
We are also seeking Ali-Star (阿里星) candidates in these fields.
Model decomposition: Monolithic modeling_graphgpt.py split into modeling_common.py, modeling_helpers.py, modeling_pretrain.py, modeling_finetune.py, and configuration_graphgpt.py. Backward-compatible imports preserved.
Data source generalization: Registry-driven factory pattern (DatasetSpec + read_graph_dataset()) replaces monolithic data_sources.py. Adding new datasets requires only a spec definition. 80% line reduction.
Unified training pipeline: Strategy-based TrainingPipeline with TrainingMode ABC eliminates ~240 lines of duplicated code between pre-training and fine-tuning scripts. Entry scripts reduced to ~18 lines each.
Externalized model configuration with structured YAML (configs/model/base.yaml) and dataclass configs.
12/23/2025
v0.6.1 released. Check CHANGELOG.md for details.
Config code refactoring for edge-level tasks ogbl-ppa.
11/20/2025
v0.6.0 released. Check CHANGELOG.md for details.
Generation functionality is added in analog to diffusion LLM.
Code refactoring for graph-level tasks. Edge-/Node-level tasks refactoring to be done, and welcome to contribute.
05/08/2025
v0.5.0 released. Check CHANGELOG.md for details.
Four check-points for PCQM4M-v2 is available in ModelScope, including pre-trained and fine-tuned models.
Achieving SOTA/closed-to-SOTA in 3 large scale ogb datasets:
PCQM4M-v2 (no 3D): 0.0802 (previous SOTA 0.0821)
PCQM4M-v2 (use 3D): 0.0709 (current SOTA 0.0683)
ogbl-ppa: 76.55 (previous SOTA 73.74)
ogbl-citation2: 93.05 (previous SOTA 90.72)
Paper accepted by ICML2025, updated to v2 in arxiv.
10/13/2024
v0.4.0 released. Check CHANGELOG.md for details.
Achieving SOTA in 3 large scale ogb datasets:
PCQM4M-v2 (no 3D): 0.0802 (previous SOTA 0.0821)
ogbl-ppa: 68.76 (previous SOTA 65.24)
ogbl-citation2: 91.15 (previous SOTA 90.72)
08/18/2024
v0.3.1 released. Check CHANGELOG.md for details.
07/09/2024
v0.3.0 released.
03/19/2024
v0.2.0 released.
Implement permute_nodes for graph-level map-style dataset, in order to increase variations of Eulerian paths,
and result in better and robust results.
Add StackedGSTTokenizer so that semantics (i.e., node/edge attrs) tokens can be stacked together with structural
tokens, and the length of sequence would be reduced a lot.
refactor codes.
01/23/2024
v0.1.1, fix bugs of common-io package.
01/03/2024
Initial release of codes.
Future Directions
Scaling Law: What’s the Scaling Limit of GraphGPT Models?
GPT models trained on text data can scale to hundreds of billions of parameters while continually improving their capabilities.
Text data provides trillions of tokens with high complexity, embedding rich social and natural knowledge.
In contrast, graph data without node/edge attributes contains only structural information, which is far more limited than text. Much of the hidden information (e.g., degrees, substructure counts, etc.) in graphs can be calculated exactly using tools like NetworkX. Consequently, structural information alone may not support scaling models to billions of parameters.
Preliminary experiments on large-scale graph datasets show that GraphGPT scales to 200M+ parameters with performance gains but plateaus beyond this. While insufficient experimentation could be a factor, inherent limitations in graph data complexity may contribute.
Large graph datasets (e.g., one massive graph or numerous small graphs) with node/edge attributes might provide sufficient information for training large GraphGPT models. However, diverse datasets may be necessary to train a universal model.
A key challenge is designing a universal tokenizer for heterogeneous node/edge attributes across datasets.
High-Quality Graph Data: What Defines High-Quality Graph Data for Training General-Purpose GraphGPT?
Example: Training a model for molecule understanding/generation tasks
Adding ZINC (4.6M) and CEPDB (2.3M) datasets during pretraining yielded no improvement on the PCQM4M-v2 HOMO-LUMO gap prediction task. Potential reasons:
Semantics: Chemical rules are straightforward (e.g., carbon = 4 bonds, nitrogen = 3 bonds). Satisfying bond counts enables valid molecule generation.
These simple structural/semantic rules allow medium-sized models to learn effectively from modest datasets. Pretraining small/medium/base/large models on 3.7M molecules resulted in similar loss values, suggesting diminishing returns from scaling.
Training a universal graph structure understanding model
Should training data include real-world graphs (social/citation networks) or synthetic graphs (Erdős–Rényi)?
Pretraining on synthetic graphs improves structural understanding but shows instability. Performance likely depends on alignment between pretraining and fine-tuning graph distributions (e.g., node/edge counts).
Key Question: How can GraphGPT achieve universal understanding of arbitrary graph structures?
This ties back to scaling laws: Identifying rich, diverse graph data is critical for scaling GraphGPT to handle varied tasks.
Few-Shot Learning: Can GraphGPT Achieve Few-Shot Capability?
Designing training data for few-shot learning
Preliminary tests on PCQM4M-v2 show no few-shot ability, but this could stem from:
Model size: The base model (~100M parameters) may be too small.
Data volume: 3.7M molecules may offer insufficient tokens for robust learning.
Data format: Current pretraining formats may not encourage few-shot generalization.
Overview:
We introduce GraphGPT, a novel self-supervised generative pre-trained model for graph learning based on the Graph Eulerian Transformer (GET).
First, we propose GET, which combines a standard transformer encoder or decoder architecture with an innovative graph-to-sequence transformation method.
This method converts graphs or sampled subgraphs into sequences of tokens representing nodes, edges, and attributes in a reversible manner using Eulerian paths.
We pre-train GET using either of the two self-supervised tasks: next-token prediction (NTP)
and scheduled maskedtoken prediction (SMTP).
The pre-trained model is then fine-tuned for downstream tasks such as graph-, edge-, and node-level prediction.
Despite its simplicity, GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets.
It demonstrates exceptional results on the molecular property prediction dataset PCQM4Mv2 and the protein-protein interaction dataset ogbl-ppa.
Notably, generative pretraining enables scaling GraphGPT to 2 billion parameters while maintaining performance gains — a breakthrough that overcomes the scalability
limitations of traditional Graph Neural Networks (GNNs) and prior graph transformers (GTs).
To advance research in graph foundation models and facilitate scientific discovery in chemistry, materials science, and related fields,
we will release the source code and pre-trained checkpoints.
Graph to Sequences
After converting Eulerized graphs to sequences, there are several different ways to attach node and edge attributes to
the sequences. We name these methods as short, long and prolonged.
Given the graph, we Eulerize it first, and then turn it into an equivalent sequence. And then, we re-index the nodes
cyclically.
Assume the graph has one node attributes and one edge attributes, and then the short, long and prolong method
are shown above.
In the above figures, n1, n2 and e1 represents the tokens of node and edge attributes, and [p] represents the
padding token.
Cyclical node re-index
A straightforward way to re-index the sequence of nodes is to start with 0 and add 1 incrementally. By this way, tokens
of small indices will be sufficiently trained, and the large indices won’t. To overcome this, we propose
cyclical re-index, which starts with a random number in the given range, say [0, 255], and increment by 1.
After hitting the boundary, e.g., 255, the next node index will be 0.
The datasets are downloaded using python package ogb.
When you run scripts in ./examples, the dataset will be automatically downloaded.
However, the dataset PCQM4M-v2 is huge, and downloading and
preprocessing might be problematic. We suggest cd ./src/utils/ and python dataset_utils.py
to download and preprocess dataset separately.
Run
Pre-train: Modify parameters in ./examples/graph_lvl/pcqm4m_v2_pretrain.sh, e.g., dataset_name, model_name,
batch_size, workerCount and etc, and then run ./examples/graph_lvl/pcqm4m_v2_pretrain.sh to pretrain
the model with the PCQM4M-v2 dataset.
To run toy example, run ./examples/toy_examples/reddit_pretrain.sh directly.
Fine-tune: Modify parameters in ./examples/graph_lvl/pcqm4m_v2_supervised.sh, e.g., dataset_name, model_name,
batch_size, workerCount, pretrain_cpt and etc, and then run ./examples/graph_lvl/pcqm4m_v2_supervised.sh
to fine-tune with downstream tasks.
To run toy example, run ./examples/toy_examples/reddit_supervised.sh directly.
Sincerely appreciate your suggestions on our work!
License
Released under the MIT license (see LICENSE):
Ali-GraphGPT-project is an AI project on training large scale transformer with graph datasets,
developed by Alibaba and licensed under the MIT License.
GraphGPT: Generative Pre-trained Graph Eulerian Transformer ICML2025
This repository is the official implementation of “GraphGPT: Generative Pre-trained Graph Eulerian Transformer” in PyTorch.
Synergy with diffusion LLM (dLLM)
Hiring
Campus Recruitng is ongoing in fields such as Agent, LLM, MLLM, AIGC, AI4Sci, and more.
We are also seeking Ali-Star (阿里星) candidates in these fields.
Work place: Hangzhou, China
Feel free to contact james.zqf@alibaba-inc.com for more information.
Update:
04/07/2026
CHANGELOG.mdfor details.flex_attentionwith sequence packing for efficient multi-sample training, achieving significant speedup03/18/2026
CHANGELOG.mdfor details.modeling_graphgpt.pysplit intomodeling_common.py,modeling_helpers.py,modeling_pretrain.py,modeling_finetune.py, andconfiguration_graphgpt.py. Backward-compatible imports preserved.DatasetSpec+read_graph_dataset()) replaces monolithicdata_sources.py. Adding new datasets requires only a spec definition. 80% line reduction.TrainingPipelinewithTrainingModeABC eliminates ~240 lines of duplicated code between pre-training and fine-tuning scripts. Entry scripts reduced to ~18 lines each.configs/model/base.yaml) and dataclass configs.12/23/2025
CHANGELOG.mdfor details.ogbl-ppa.11/20/2025
CHANGELOG.mdfor details.05/08/2025
CHANGELOG.mdfor details.10/13/2024
CHANGELOG.mdfor details.08/18/2024
CHANGELOG.mdfor details.07/09/2024
03/19/2024
permute_nodesfor graph-level map-style dataset, in order to increase variations of Eulerian paths, and result in better and robust results.StackedGSTTokenizerso that semantics (i.e., node/edge attrs) tokens can be stacked together with structural tokens, and the length of sequence would be reduced a lot.01/23/2024
01/03/2024
Future Directions
Scaling Law: What’s the Scaling Limit of GraphGPT Models?
High-Quality Graph Data: What Defines High-Quality Graph Data for Training General-Purpose GraphGPT?
Example: Training a model for molecule understanding/generation tasks
Training a universal graph structure understanding model
This ties back to scaling laws: Identifying rich, diverse graph data is critical for scaling GraphGPT to handle varied tasks.
Few-Shot Learning: Can GraphGPT Achieve Few-Shot Capability?
Overview:
We introduce GraphGPT, a novel self-supervised generative pre-trained model for graph learning based on the Graph Eulerian Transformer (GET). First, we propose GET, which combines a standard transformer encoder or decoder architecture with an innovative graph-to-sequence transformation method. This method converts graphs or sampled subgraphs into sequences of tokens representing nodes, edges, and attributes in a reversible manner using Eulerian paths. We pre-train GET using either of the two self-supervised tasks: next-token prediction (NTP) and scheduled maskedtoken prediction (SMTP). The pre-trained model is then fine-tuned for downstream tasks such as graph-, edge-, and node-level prediction. Despite its simplicity, GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets. It demonstrates exceptional results on the molecular property prediction dataset PCQM4Mv2 and the protein-protein interaction dataset ogbl-ppa. Notably, generative pretraining enables scaling GraphGPT to 2 billion parameters while maintaining performance gains — a breakthrough that overcomes the scalability limitations of traditional Graph Neural Networks (GNNs) and prior graph transformers (GTs). To advance research in graph foundation models and facilitate scientific discovery in chemistry, materials science, and related fields, we will release the source code and pre-trained checkpoints.
Graph to Sequences
After converting Eulerized graphs to sequences, there are several different ways to attach node and edge attributes to the sequences. We name these methods as
short,longandprolonged.Given the graph, we Eulerize it first, and then turn it into an equivalent sequence. And then, we re-index the nodes cyclically.
Assume the graph has one node attributes and one edge attributes, and then the
short,longandprolongmethod are shown above.In the above figures,
n1,n2ande1represents the tokens of node and edge attributes, and[p]represents the padding token.Cyclical node re-index
A straightforward way to re-index the sequence of nodes is to start with 0 and add 1 incrementally. By this way, tokens of small indices will be sufficiently trained, and the large indices won’t. To overcome this, we propose
cyclical re-index, which starts with a random number in the given range, say[0, 255], and increment by 1. After hitting the boundary, e.g.,255, the next node index will be 0.Results updated @ 2025-05-15
Graph-level task: PCQM4M-v2 and ogbg-molpcba
Edge-level tasks: ogbl-ppa and ogbl-citation2
Node-level-task: ogbn-proteins and ogbn-products
Installation
Datasets
The datasets are downloaded using python package ogb.
When you run scripts in
./examples, the dataset will be automatically downloaded.However, the dataset PCQM4M-v2 is huge, and downloading and preprocessing might be problematic. We suggest
cd ./src/utils/andpython dataset_utils.pyto download and preprocess dataset separately.Run
./examples/graph_lvl/pcqm4m_v2_pretrain.sh, e.g.,dataset_name,model_name,batch_size,workerCountand etc, and then run./examples/graph_lvl/pcqm4m_v2_pretrain.shto pretrain the model with the PCQM4M-v2 dataset../examples/toy_examples/reddit_pretrain.shdirectly../examples/graph_lvl/pcqm4m_v2_supervised.sh, e.g.,dataset_name,model_name,batch_size,workerCount,pretrain_cptand etc, and then run./examples/graph_lvl/pcqm4m_v2_supervised.shto fine-tune with downstream tasks../examples/toy_examples/reddit_supervised.shdirectly.Project Structure
Code Norm
Pre-commit
.pre-commit-config.yaml: create the file with following content for pythonpre-commit install: install pre-commit into your git hooks.pre-commit installshould always be the first thing you do.pre-commit run --all-files: run all pre-commit hooks on a repositorypre-commit autoupdate: update your hooks to the latest version automaticallygit commit -n: pre-commit checks can be disabled for a particular commit with the commandCitation
If you find this work useful, please kindly cite following papers:
@inproceedingszhao2025graphgpt,title=GraphGPT:GenerativePre−trainedGraphEulerianTransformer,author=QifangZhaoandWeidongRenandTianyuLiandHongLiuandXingshengHeandXiaoxiaoXu,booktitle=Forty−secondInternationalConferenceonMachineLearning,year=2025,url=https://openreview.net/forum?id=4RdzeucFmW
Contact
Qifang Zhao (james.zqf@alibaba-inc.com)
Sincerely appreciate your suggestions on our work!
License
Released under the MIT license (see
LICENSE):