We provide a demo website for you to play with VAR models and generate images interactively. Enjoy the fun of visual autoregressive modeling!
We provide a demo website for you to play with VAR Text-to-Image and generate images interactively. Enjoy the fun of visual autoregressive modeling!
We also provide demo_sample.ipynb for you to see more technical details about VAR.
What’s New?
🔥 Introducing VAR: a new paradigm in autoregressive visual generation✨:
Visual Autoregressive Modeling (VAR) redefines the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”, diverging from the standard raster-scan “next-token prediction”.
🔥 For the first time, GPT-style autoregressive models surpass diffusion models🚀:
🔥 Discovering power-law Scaling Laws in VAR transformers📈:
🔥 Zero-shot generalizability🛠️:
For a deep dive into our analyses, discussions, and evaluations, check out our paper.
VAR zoo
We provide VAR models for you to play with, which are on or can be downloaded from the following links:
NOTE: The arg --data_path=/path/to/imagenet should be passed to the training script.
(Optional) install and compile flash-attn and xformers for faster attention computation. Our code will automatically use them if installed. See models/basic_var.py#L15-L30.
Training Scripts
To train VAR-{d16, d20, d24, d30, d36-s} on ImageNet 256x256 or 512x512, you can run the following command:
A folder named local_output will be created to save the checkpoints and logs.
You can monitor the training process by checking the logs in local_output/log.txt and local_output/stdout.txt, or using tensorboard --logdir=local_output/.
If your experiment is interrupted, just rerun the command, and the training will automatically resume from the last checkpoint in local_output/ckpt*.pth (see utils/misc.py#L344-L357).
Sampling & Zero-shot Inference
For FID evaluation, use var.autoregressive_infer_cfg(..., cfg=1.5, top_p=0.96, top_k=900, more_smooth=False) to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a .npz file via create_npz_from_sample_folder(sample_folder) in utils/misc.py#L344.
Then use the OpenAI’s FID evaluation toolkit and reference ground truth npz file of 256x256 or 512x512 to evaluate FID, IS, precision, and recall.
Note a relatively small cfg=1.5 is used for trade-off between image quality and diversity. You can adjust it to cfg=5.0, or sample with autoregressive_infer_cfg(..., more_smooth=True) for better visual quality.
We’ll provide the sampling script later.
Third-party Usage and Research
In this pargraph, we cross link third-party repositories or research which use VAR and report results. You can let us know by raising an issue
(Note please report accuracy numbers and provide trained models in your new repository to facilitate others to get sense of correctness and model behavior)
Time
Research
Link
[5/12/2025]
[ICML 2025]Continuous Visual Autoregressive Generation via Score Maximization
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If our work assists your research, feel free to give us a star ⭐ or cite us using:
@Article{VAR,
title={Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction},
author={Keyu Tian and Yi Jiang and Zehuan Yuan and Bingyue Peng and Liwei Wang},
year={2024},
eprint={2404.02905},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{Infinity,
title={Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis},
author={Jian Han and Jinlai Liu and Yi Jiang and Bin Yan and Yuqi Zhang and Zehuan Yuan and Bingyue Peng and Xiaobing Liu},
year={2024},
eprint={2412.04431},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.04431},
}
VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed📈
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
NeurIPS 2024 Best Paper
News
🕹️ Try and Play with VAR!
We provide a demo website for you to play with VAR models and generate images interactively. Enjoy the fun of visual autoregressive modeling!We provide a demo website for you to play with VAR Text-to-Image and generate images interactively. Enjoy the fun of visual autoregressive modeling!
We also provide demo_sample.ipynb for you to see more technical details about VAR.
What’s New?
🔥 Introducing VAR: a new paradigm in autoregressive visual generation✨:
Visual Autoregressive Modeling (VAR) redefines the autoregressive learning on images as coarse-to-fine “next-scale prediction” or “next-resolution prediction”, diverging from the standard raster-scan “next-token prediction”.
🔥 For the first time, GPT-style autoregressive models surpass diffusion models🚀:
🔥 Discovering power-law Scaling Laws in VAR transformers📈:
🔥 Zero-shot generalizability🛠️:
For a deep dive into our analyses, discussions, and evaluations, check out our paper.
VAR zoo
We provide VAR models for you to play with, which are on
or can be downloaded from the following links:
You can load these models to generate images via the codes in demo_sample.ipynb. Note: you need to download vae_ch160v4096z32.pth first.
Installation
Install
torch>=2.0.0.Install other pip packages via
pip3 install -r requirements.txt.Prepare the ImageNet dataset
assume the ImageNet is in `/path/to/imagenet`. It should be like this:
NOTE: The arg
--data_path=/path/to/imagenetshould be passed to the training script.(Optional) install and compile
flash-attnandxformersfor faster attention computation. Our code will automatically use them if installed. See models/basic_var.py#L15-L30.Training Scripts
To train VAR-{d16, d20, d24, d30, d36-s} on ImageNet 256x256 or 512x512, you can run the following command:
A folder named
local_outputwill be created to save the checkpoints and logs. You can monitor the training process by checking the logs inlocal_output/log.txtandlocal_output/stdout.txt, or usingtensorboard --logdir=local_output/.If your experiment is interrupted, just rerun the command, and the training will automatically resume from the last checkpoint in
local_output/ckpt*.pth(see utils/misc.py#L344-L357).Sampling & Zero-shot Inference
For FID evaluation, use
var.autoregressive_infer_cfg(..., cfg=1.5, top_p=0.96, top_k=900, more_smooth=False)to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a.npzfile viacreate_npz_from_sample_folder(sample_folder)in utils/misc.py#L344. Then use the OpenAI’s FID evaluation toolkit and reference ground truth npz file of 256x256 or 512x512 to evaluate FID, IS, precision, and recall.Note a relatively small
cfg=1.5is used for trade-off between image quality and diversity. You can adjust it tocfg=5.0, or sample withautoregressive_infer_cfg(..., more_smooth=True)for better visual quality. We’ll provide the sampling script later.Third-party Usage and Research
In this pargraph, we cross link third-party repositories or research which use VAR and report results. You can let us know by raising an issue
(
Note please report accuracy numbers and provide trained models in your new repository to facilitate others to get sense of correctness and model behavior)License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If our work assists your research, feel free to give us a star ⭐ or cite us using: