GraphGPT: Generative Pre-trained Graph Eulerian Transformer ICML2025

This repository is the official implementation of “GraphGPT: Generative Pre-trained Graph Eulerian Transformer” in PyTorch.

GraphGPT: Generative Pre-trained Graph Eulerian Transformer

Qifang Zhao, Weidong Ren, Tianyu Li, Hong Liu, Xingsheng He, Xiaoxiao Xu

Synergy with diffusion LLM (dLLM)

Recently Inception Labs’ Mercury Coder and Google’s Gemini Diffusion have shown some advantages of dLLM over AR LLM.
Open sourced dLLM models like LLaDA and Dream are catching eyes.
Various speeding up techniques like dKV-Cache and Fast-dLLM are being developed.
GraphGPT’s pre-training objective SMTP is directly adopted from MaskGIT, which is deeply connected with dLLM’s training objective. We show that SMTP pre-training is better than NTP (employed by AR LLM) in most graph datasets and tasks.
Given the dominance of diffusion in images, audios, videos and graphs, it could serve as a promising candidate for unifying majority modalities.
GraphGPT’s results imply that dLLM can directly model the serialized graph data as in GraphGPT. This could help dLLM unify the graph modality very easily, and also benefit the graph research community, since it naturally serves as the graph foundation model.

Hiring

Campus Recruitng is ongoing in fields such as Agent, LLM, MLLM, AIGC, AI4Sci, and more.

We are also seeking Ali-Star (阿里星) candidates in these fields.

Work place: Hangzhou, China

Feel free to contact james.zqf@alibaba-inc.com for more information.

Update:

04/07/2026

v0.8.0 released. Check CHANGELOG.md for details.
Major performance optimizations:
- Flex Attention: Implemented flex_attention with sequence packing for efficient multi-sample training, achieving significant speedup
- Speed improvements: AI-optimized Eulerian path algorithms, reduced CUDA sync overhead, tokenization pipeline optimizations
Monitoring & profiling: Integrated WandB experiment tracking and Torch Profiler for performance analysis
Code quality: Major tokenizer refactoring (modular design), removed redundant parameters, added pre-commit enforcement

03/18/2026

v0.7.0 released. Check CHANGELOG.md for details.
Major codebase refactoring:
- Model decomposition: Monolithic modeling_graphgpt.py split into modeling_common.py, modeling_helpers.py, modeling_pretrain.py, modeling_finetune.py, and configuration_graphgpt.py. Backward-compatible imports preserved.
- Data source generalization: Registry-driven factory pattern (DatasetSpec + read_graph_dataset()) replaces monolithic data_sources.py. Adding new datasets requires only a spec definition. 80% line reduction.
- Unified training pipeline: Strategy-based TrainingPipeline with TrainingMode ABC eliminates ~240 lines of duplicated code between pre-training and fine-tuning scripts. Entry scripts reduced to ~18 lines each.
Externalized model configuration with structured YAML (configs/model/base.yaml) and dataclass configs.

12/23/2025

v0.6.1 released. Check CHANGELOG.md for details.
Config code refactoring for edge-level tasks ogbl-ppa.

11/20/2025

v0.6.0 released. Check CHANGELOG.md for details.
Generation functionality is added in analog to diffusion LLM.
Code refactoring for graph-level tasks. Edge-/Node-level tasks refactoring to be done, and welcome to contribute.

05/08/2025

v0.5.0 released. Check CHANGELOG.md for details.
Four check-points for PCQM4M-v2 is available in ModelScope, including pre-trained and fine-tuned models.
Achieving SOTA/closed-to-SOTA in 3 large scale ogb datasets:
- PCQM4M-v2 (no 3D): 0.0802 (previous SOTA 0.0821)
- PCQM4M-v2 (use 3D): 0.0709 (current SOTA 0.0683)
- ogbl-ppa: 76.55 (previous SOTA 73.74)
- ogbl-citation2: 93.05 (previous SOTA 90.72)
Paper accepted by ICML2025, updated to v2 in arxiv.

10/13/2024

v0.4.0 released. Check CHANGELOG.md for details.
Achieving SOTA in 3 large scale ogb datasets:
- PCQM4M-v2 (no 3D): 0.0802 (previous SOTA 0.0821)
- ogbl-ppa: 68.76 (previous SOTA 65.24)
- ogbl-citation2: 91.15 (previous SOTA 90.72)

08/18/2024

v0.3.1 released. Check CHANGELOG.md for details.

07/09/2024

v0.3.0 released.

03/19/2024

v0.2.0 released.
Implement permute_nodes for graph-level map-style dataset, in order to increase variations of Eulerian paths, and result in better and robust results.
Add StackedGSTTokenizer so that semantics (i.e., node/edge attrs) tokens can be stacked together with structural tokens, and the length of sequence would be reduced a lot.
refactor codes.

01/23/2024

v0.1.1, fix bugs of common-io package.

01/03/2024

Initial release of codes.

Future Directions

Scaling Law: What’s the Scaling Limit of GraphGPT Models?

GPT models trained on text data can scale to hundreds of billions of parameters while continually improving their capabilities.
Text data provides trillions of tokens with high complexity, embedding rich social and natural knowledge.
In contrast, graph data without node/edge attributes contains only structural information, which is far more limited than text. Much of the hidden information (e.g., degrees, substructure counts, etc.) in graphs can be calculated exactly using tools like NetworkX. Consequently, structural information alone may not support scaling models to billions of parameters.
- Preliminary experiments on large-scale graph datasets show that GraphGPT scales to 200M+ parameters with performance gains but plateaus beyond this. While insufficient experimentation could be a factor, inherent limitations in graph data complexity may contribute.
Large graph datasets (e.g., one massive graph or numerous small graphs) with node/edge attributes might provide sufficient information for training large GraphGPT models. However, diverse datasets may be necessary to train a universal model.
- A key challenge is designing a universal tokenizer for heterogeneous node/edge attributes across datasets.

High-Quality Graph Data: What Defines High-Quality Graph Data for Training General-Purpose GraphGPT?

Example: Training a model for molecule understanding/generation tasks
- Adding ZINC (4.6M) and CEPDB (2.3M) datasets during pretraining yielded no improvement on the PCQM4M-v2 HOMO-LUMO gap prediction task. Potential reasons:
  - Structure: Molecule graphs exhibit simple patterns (e.g., chains, 5/6-node rings).
    - Atoms average 2 bonds per node.
  - Semantics: Chemical rules are straightforward (e.g., carbon = 4 bonds, nitrogen = 3 bonds). Satisfying bond counts enables valid molecule generation.
  - These simple structural/semantic rules allow medium-sized models to learn effectively from modest datasets. Pretraining small/medium/base/large models on 3.7M molecules resulted in similar loss values, suggesting diminishing returns from scaling.
Training a universal graph structure understanding model
- Should training data include real-world graphs (social/citation networks) or synthetic graphs (Erdős–Rényi)?
- Pretraining on synthetic graphs improves structural understanding but shows instability. Performance likely depends on alignment between pretraining and fine-tuning graph distributions (e.g., node/edge counts).
- Key Question: How can GraphGPT achieve universal understanding of arbitrary graph structures?
This ties back to scaling laws: Identifying rich, diverse graph data is critical for scaling GraphGPT to handle varied tasks.

Few-Shot Learning: Can GraphGPT Achieve Few-Shot Capability?

Designing training data for few-shot learning
Preliminary tests on PCQM4M-v2 show no few-shot ability, but this could stem from:
- Model size: The base model (~100M parameters) may be too small.
- Data volume: 3.7M molecules may offer insufficient tokens for robust learning.
- Data format: Current pretraining formats may not encourage few-shot generalization.

Overview:

Alt text

We introduce GraphGPT, a novel self-supervised generative pre-trained model for graph learning based on the Graph Eulerian Transformer (GET). First, we propose GET, which combines a standard transformer encoder or decoder architecture with an innovative graph-to-sequence transformation method. This method converts graphs or sampled subgraphs into sequences of tokens representing nodes, edges, and attributes in a reversible manner using Eulerian paths. We pre-train GET using either of the two self-supervised tasks: next-token prediction (NTP) and scheduled maskedtoken prediction (SMTP). The pre-trained model is then fine-tuned for downstream tasks such as graph-, edge-, and node-level prediction. Despite its simplicity, GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets. It demonstrates exceptional results on the molecular property prediction dataset PCQM4Mv2 and the protein-protein interaction dataset ogbl-ppa. Notably, generative pretraining enables scaling GraphGPT to 2 billion parameters while maintaining performance gains — a breakthrough that overcomes the scalability limitations of traditional Graph Neural Networks (GNNs) and prior graph transformers (GTs). To advance research in graph foundation models and facilitate scientific discovery in chemistry, materials science, and related fields, we will release the source code and pre-trained checkpoints.

Graph to Sequences

After converting Eulerized graphs to sequences, there are several different ways to attach node and edge attributes to the sequences. We name these methods as short, long and prolonged.

Given the graph, we Eulerize it first, and then turn it into an equivalent sequence. And then, we re-index the nodes cyclically.

Assume the graph has one node attributes and one edge attributes, and then the short, long and prolong method are shown above.

In the above figures, n1, n2 and e1 represents the tokens of node and edge attributes, and [p] represents the padding token.

Cyclical node re-index

A straightforward way to re-index the sequence of nodes is to start with 0 and add 1 incrementally. By this way, tokens of small indices will be sufficiently trained, and the large indices won’t. To overcome this, we propose cyclical re-index, which starts with a random number in the given range, say [0, 255], and increment by 1. After hitting the boundary, e.g., 255, the next node index will be 0.

Results updated @ 2025-05-15

Graph-level task: PCQM4M-v2 and ogbg-molpcba

Edge-level tasks: ogbl-ppa and ogbl-citation2

Node-level-task: ogbn-proteins and ogbn-products

Installation

Clone this repository

git clone https://github.com/alibaba/graph-gpt.git

Install the dependencies in requirements.txt (Using Anaconda)
- ver <= 0.5.0 tested with py38, pytorch-1131 and CUDA-11.7, 11.8 and 12.1 on GPU V100 and A800
- ver >= 0.6.0 tested with py310, pytorch-251 and CUDA-12.4 on GPU V100 and A800

conda create -n graph_gpt python=3.10 "numpy<2" pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
conda activate graph_gpt
cd graph-gpt
pip install -r ./requirements.txt
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.5.1+cpu.html
sudo apt-get update && apt-get install -y bc
pre-commit install

Datasets

The datasets are downloaded using python package ogb.

When you run scripts in ./examples, the dataset will be automatically downloaded.

However, the dataset PCQM4M-v2 is huge, and downloading and preprocessing might be problematic. We suggest cd ./src/utils/ and python dataset_utils.py to download and preprocess dataset separately.

Run

Pre-train: Modify parameters in ./examples/graph_lvl/pcqm4m_v2_pretrain.sh, e.g., dataset_name, model_name, batch_size, workerCount and etc, and then run ./examples/graph_lvl/pcqm4m_v2_pretrain.sh to pretrain the model with the PCQM4M-v2 dataset.
- To run toy example, run ./examples/toy_examples/reddit_pretrain.sh directly.
Fine-tune: Modify parameters in ./examples/graph_lvl/pcqm4m_v2_supervised.sh, e.g., dataset_name, model_name, batch_size, workerCount, pretrain_cpt and etc, and then run ./examples/graph_lvl/pcqm4m_v2_supervised.sh to fine-tune with downstream tasks.
- To run toy example, run ./examples/toy_examples/reddit_supervised.sh directly.

Project Structure

graph-gpt/
├── configs/                          # Hydra/OmegaConf YAML configurations
│   └── model/base.yaml               # Model architecture configuration
├── examples/
│   ├── train_pretrain.py              # Pre-training entry point (thin wrapper)
│   └── train_supervised.py            # Fine-tuning entry point (thin wrapper)
├── src/
│   ├── conf/                          # Dataclass-based configuration
│   │   └── model/model_configs.py     # Structured model config dataclasses
│   ├── data/
│   │   ├── data_sources.py            # Dataset spec registry and entry point
│   │   ├── _graph_factory.py          # DatasetSpec + generic read_graph_dataset()
│   │   ├── _readers/                  # Dataset-specific readers
│   │   │   ├── pcqm4mv2.py            #   PCQM4M-v2 molecular dataset
│   │   │   ├── edge_level.py          #   Edge-level tasks (link prediction)
│   │   │   └── node_level.py          #   Node-level tasks (classification)
│   │   └── _helpers/                  # Reusable data utilities
│   │       ├── edge_formatting.py
│   │       ├── graph_utils.py
│   │       └── node_encoding.py
│   ├── models/graphgpt/
│   │   ├── configuration_graphgpt.py  # GraphGPTConfig + legacy bridge
│   │   ├── modeling_common.py         # Shared model infrastructure
│   │   ├── modeling_helpers.py        # Helper functions (masks, losses, embeddings)
│   │   ├── modeling_pretrain.py       # Pre-training models (NTP/SMTP/PosPred)
│   │   ├── modeling_finetune.py       # Fine-tuning models (task/denoise heads)
│   │   └── modeling_graphgpt.py       # Backward-compatible re-exports
│   ├── training/
│   │   ├── pipeline.py                # TrainingPipeline (shared orchestration)
│   │   ├── mode.py                    # TrainingMode ABC (strategy interface)
│   │   ├── pretrain_mode.py           # PretrainMode (step-level training)
│   │   └── finetune_mode.py           # FinetuneMode (epoch-level training)
│   └── utils/                         # Utility functions
└── requirements.txt

Code Norm

Pre-commit

Check the official website for details

.pre-commit-config.yaml: create the file with following content for python

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace
-   repo: https://github.com/psf/black
    rev: 23.7.0
    hooks:
    -   id: black

pre-commit install: install pre-commit into your git hooks.
- pre-commit will now run on every commit.
- Every time you clone a project using pre-commit running pre-commit install should always be the first thing you do.
pre-commit run --all-files: run all pre-commit hooks on a repository
pre-commit autoupdate: update your hooks to the latest version automatically
git commit -n: pre-commit checks can be disabled for a particular commit with the command

Citation

If you find this work useful, please kindly cite following papers:

$@ i n p r o c e e d i n g s z h a o 2025 g r a p h g p t, t i t l e = G r a p h G P T : G e n e r a t i v e P r e - t r a i n e d G r a p h E u l e r i a n T r a n s f o r m e r, a u t h o r = Q i f a n g Z h a o a n d W e i d o n g R e n a n d T i a n y u L i a n d H o n g L i u a n d X i n g s h e n g H e a n d X i a o x i a o X u, b o o k t i t l e = F o r t y - s e c o n d I n t e r n a t i o n a l C o n f e r e n c e o n M a c h i n e L e a r n i n g, y e a r = 2025, u r l = h t t p s : / / o p e n r e v i e w . n e t / f o r u m ? i d = 4 R d z e u c F m W$

Contact

Qifang Zhao (james.zqf@alibaba-inc.com)

Sincerely appreciate your suggestions on our work!

License

Released under the MIT license (see LICENSE):

Ali-GraphGPT-project is an AI project on training large scale transformer with graph datasets,
developed by Alibaba and licensed under the MIT License.