VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed📈

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

NeurIPS 2024 Best Paper

News

2025-11: We Release our Text-to-Video generation model InfinityStar based on VAR & Infinity, please check Infinity⭐️.
2025-11: 🎉 InfinityStar is accepted as NeurIPS 2025 Oral.
2025-04: 🎉 Infinity is accepted as CVPR 2025 Oral.
2024-12: 🏆 VAR received NeurIPS 2024 Best Paper Award.
2024-12: 🔥 We Release our Text-to-Image research based on VAR, please check Infinity.
2024-09: VAR is accepted as NeurIPS 2024 Oral Presentation.
2024-04: Visual AutoRegressive modeling is released.

🕹️ Try and Play with VAR!

~~We provide a demo website for you to play with VAR models and generate images interactively. Enjoy the fun of visual autoregressive modeling!~~

We provide a demo website for you to play with VAR Text-to-Image and generate images interactively. Enjoy the fun of visual autoregressive modeling!

We also provide demo_sample.ipynb for you to see more technical details about VAR.

What’s New?

🔥 Introducing VAR: a new paradigm in autoregressive visual generation✨:

Visual Autoregressive Modeling (VAR) redefines the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”, diverging from the standard raster-scan “next-token prediction”.

🔥 For the first time, GPT-style autoregressive models surpass diffusion models🚀:

🔥 Discovering power-law Scaling Laws in VAR transformers📈:

🔥 Zero-shot generalizability🛠️:

For a deep dive into our analyses, discussions, and evaluations, check out our paper.

VAR zoo

We provide VAR models for you to play with, which are on or can be downloaded from the following links:

model	reso.	FID	rel. cost	#params	HF weights🤗
VAR-d16	256	3.55	0.4	310M	var_d16.pth
VAR-d20	256	2.95	0.5	600M	var_d20.pth
VAR-d24	256	2.33	0.6	1.0B	var_d24.pth
VAR-d30	256	1.97	1	2.0B	var_d30.pth
VAR-d30-re	256	1.80	1	2.0B	var_d30.pth
VAR-d36	512	2.63	-	2.3B	var_d36.pth

You can load these models to generate images via the codes in demo_sample.ipynb. Note: you need to download vae_ch160v4096z32.pth first.

Installation

Install torch>=2.0.0.
Install other pip packages via pip3 install -r requirements.txt.

Prepare the ImageNet dataset

assume the ImageNet is in `/path/to/imagenet`. It should be like this:

/path/to/imagenet/:
    train/:
        n01440764: 
            many_images.JPEG ...
        n01443537:
            many_images.JPEG ...
    val/:
        n01440764:
            ILSVRC2012_val_00000293.JPEG ...
        n01443537:
            ILSVRC2012_val_00000236.JPEG ...

NOTE: The arg --data_path=/path/to/imagenet should be passed to the training script.

(Optional) install and compile flash-attn and xformers for faster attention computation. Our code will automatically use them if installed. See models/basic_var.py#L15-L30.

Training Scripts

To train VAR-{d16, d20, d24, d30, d36-s} on ImageNet 256x256 or 512x512, you can run the following command:

# d16, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
# d20, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=20 --bs=768 --ep=250 --fp16=1 --alng=1e-3 --wpe=0.1
# d24, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=24 --bs=768 --ep=350 --tblr=8e-5 --fp16=1 --alng=1e-4 --wpe=0.01
# d30, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=30 --bs=1024 --ep=350 --tblr=8e-5 --fp16=1 --alng=1e-5 --wpe=0.01 --twde=0.08
# d36-s, 512x512 (-s means saln=1, shared AdaLN)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
  --depth=36 --saln=1 --pn=512 --bs=768 --ep=350 --tblr=8e-5 --fp16=1 --alng=5e-6 --wpe=0.01 --twde=0.08

A folder named local_output will be created to save the checkpoints and logs. You can monitor the training process by checking the logs in local_output/log.txt and local_output/stdout.txt, or using tensorboard --logdir=local_output/.

If your experiment is interrupted, just rerun the command, and the training will automatically resume from the last checkpoint in local_output/ckpt*.pth (see utils/misc.py#L344-L357).

Sampling & Zero-shot Inference

For FID evaluation, use var.autoregressive_infer_cfg(..., cfg=1.5, top_p=0.96, top_k=900, more_smooth=False) to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a .npz file via create_npz_from_sample_folder(sample_folder) in utils/misc.py#L344. Then use the OpenAI’s FID evaluation toolkit and reference ground truth npz file of 256x256 or 512x512 to evaluate FID, IS, precision, and recall.

Note a relatively small cfg=1.5 is used for trade-off between image quality and diversity. You can adjust it to cfg=5.0, or sample with autoregressive_infer_cfg(..., more_smooth=True) for better visual quality. We’ll provide the sampling script later.

Third-party Usage and Research

In this pargraph, we cross link third-party repositories or research which use VAR and report results. You can let us know by raising an issue

(Note please report accuracy numbers and provide trained models in your new repository to facilitate others to get sense of correctness and model behavior)

Time	Research	Link
[5/12/2025]	[ICML 2025]Continuous Visual Autoregressive Generation via Score Maximization	https://github.com/shaochenze/EAR
[5/8/2025]	Generative Autoregressive Transformers for Model-Agnostic Federated MRI Reconstruction	https://github.com/icon-lab/FedGAT
[4/7/2025]	FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning	https://github.com/csguoh/FastVAR
[4/3/2025]	VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning	https://github.com/VARGPT-family/VARGPT-v1.1
[3/31/2025]	Training-Free Text-Guided Image Editing with Visual Autoregressive Model	https://github.com/wyf0912/AREdit
[3/17/2025]	Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers	https://github.com/Shiran-Yuan/ArchonView
[3/14/2025]	Safe-VAR: Safe Visual Autoregressive Model for Text-to-Image Generative Watermarking	https://arxiv.org/abs/2503.11324
[3/3/2025]	[ICML 2025]Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator	https://research.nvidia.com/labs/dir/ddo/
[2/28/2025]	Autoregressive Medical Image Segmentation via Next-Scale Mask Prediction	https://arxiv.org/abs/2502.20784
[2/27/2025]	FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction	https://github.com/jiaosiyu1999/FlexVAR
[2/17/2025]	MARS: Mesh AutoRegressive Model for 3D Shape Detailization	https://arxiv.org/abs/2502.11390
[1/31/2025]	[ICML 2025]Visual Autoregressive Modeling for Image Super-Resolution	https://github.com/quyp2000/VARSR
[1/21/2025]	VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model	https://github.com/VARGPT-family/VARGPT
[1/26/2025]	[ICML 2025]Visual Generation Without Guidance	https://github.com/thu-ml/GFT
[12/30/2024]	Next Token Prediction Towards Multimodal Intelligence	https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
[12/30/2024]	Varformer: Adapting VAR’s Generative Prior for Image Restoration	https://arxiv.org/abs/2412.21063
[12/22/2024]	[ICLR 2025]Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching	https://github.com/imagination-research/distilled-decoding
[12/19/2024]	FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching	https://github.com/OliverRensu/FlowAR
[12/13/2024]	3D representation in 512-Byte: Variational tokenizer is the key for autoregressive 3D generation	https://github.com/sparse-mvs-2/VAT
[12/9/2024]	CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction	https://carp-robot.github.io/
[12/5/2024]	[CVPR 2025]Infinity ∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis	https://github.com/FoundationVision/Infinity
[12/5/2024]	[CVPR 2025]Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis	https://github.com/yandex-research/switti
[12/4/2024]	[CVPR 2025]TokenFlow🚀: Unified Image Tokenizer for Multimodal Understanding and Generation	https://github.com/ByteFlow-AI/TokenFlow
[12/3/2024]	XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation	https://github.com/lxa9867/ImageFolder
[11/28/2024]	[CVPR 2025]CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient	https://github.com/czg1225/CoDe
[11/28/2024]	[CVPR 2025]Scalable Autoregressive Monocular Depth Estimation	https://arxiv.org/abs/2411.11361
[11/27/2024]	[CVPR 2025]SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE	https://github.com/cyw-3d/SAR3D
[11/26/2024]	LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization	https://arxiv.org/abs/2411.17178
[11/15/2024]	M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation	https://github.com/OliverRensu/MVAR
[10/14/2024]	[ICLR 2025]HART: Efficient Visual Generation with Hybrid Autoregressive Transformer	https://github.com/mit-han-lab/hart
[10/12/2024]	[ICLR 2025 Oral]Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment	https://github.com/thu-ml/CCA
[10/3/2024]	[ICLR 2025]ImageFolder🚀: Autoregressive Image Generation with Folded Tokens	https://github.com/lxa9867/ImageFolder
[07/25/2024]	ControlVAR: Exploring Controllable Visual Autoregressive Modeling	https://github.com/lxa9867/ControlVAR
[07/3/2024]	VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling	https://github.com/daixiangzi/VAR-CLIP
[06/16/2024]	STAR: Scale-wise Text-to-image generation via Auto-Regressive representations	https://arxiv.org/abs/2406.10797

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@Article{VAR,
      title={Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction}, 
      author={Keyu Tian and Yi Jiang and Zehuan Yuan and Bingyue Peng and Liwei Wang},
      year={2024},
      eprint={2404.02905},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@misc{Infinity,
    title={Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis}, 
    author={Jian Han and Jinlai Liu and Yi Jiang and Bin Yan and Yuqi Zhang and Zehuan Yuan and Bingyue Peng and Xiaobing Liu},
    year={2024},
    eprint={2412.04431},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2412.04431}, 
}