CVPR 2021 (Oral)

Taming Transformers for High-Resolution Image Synthesis
Patrick Esser*,
Robin Rombach*,
Björn Ommer
* equal contribution
tl;dr We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.
arXiv | BibTeX | Project Page
News
2022
Requirements
A suitable conda environment named taming
can be created
and activated with:
conda env create -f environment.yaml
conda activate taming
Overview of pretrained models
The following table provides an overview of all models that are currently available.
FID scores were evaluated using torch-fidelity.
For reference, we also include a link to the recently released autoencoder of the DALL-E model.
See the corresponding colab
notebook
for a comparison and discussion of reconstruction capabilities.
Running pretrained models
The commands below will start a streamlit demo which supports sampling at
different resolutions and image completions. To run a non-interactive version
of the sampling process, replace streamlit run scripts/sample_conditional.py --
by python scripts/make_samples.py --outdir <path_to_write_samples_to>
and
keep the remaining command line arguments.
To sample from unconditional or class-conditional models,
run python scripts/sample_fast.py -r <path/to/config_and_checkpoint>
.
We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models,
respectively.
S-FLCKR

You can also run this model in a Colab
notebook,
which includes all necessary steps to start sampling.
Download the
2020-11-09T13-31-51_sflckr
folder and place it into logs
. Then, run
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/
ImageNet

Download the 2021-04-03T19-39-50_cin_transformer
folder and place it into logs. Sampling from the class-conditional ImageNet
model does not require any data preparation. To produce 50 samples for each of
the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus
sampling and temperature t=1.0, run
python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25
To restrict the model to certain classes, provide them via the --classes
argument, separated by
commas. For example, to sample 50 ostriches, border collies and whiskey jugs, run
python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25 --classes 9,232,901
We recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.
FFHQ/CelebA-HQ
Download the 2021-04-23T18-19-01_ffhq_transformer and
2021-04-23T18-11-19_celebahq_transformer
folders and place them into logs.
Again, sampling from these unconditional models does not require any data preparation.
To produce 50000 samples, with k=250 for top-k sampling,
p=1.0 for nucleus sampling and temperature t=1.0, run
python scripts/sample_fast.py -r logs/2021-04-23T18-19-01_ffhq_transformer/
for FFHQ and
python scripts/sample_fast.py -r logs/2021-04-23T18-11-19_celebahq_transformer/
to sample from the CelebA-HQ model.
For both models it can be advantageous to vary the top-k/top-p parameters for sampling.
FacesHQ

Download 2020-11-13T21-41-45_faceshq_transformer and
place it into logs
. Follow the data preparation steps for
CelebA-HQ and FFHQ. Run
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-13T21-41-45_faceshq_transformer/
D-RIN

Download 2020-11-20T12-54-32_drin_transformer and
place it into logs
. To run the demo on a couple of example depth maps
included in the repository, run
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.imagenet.DRINExamples}}}"
To run the demo on the complete validation set, first follow the data preparation steps for
ImageNet and then run
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/
COCO
Download 2021-01-20T16-04-20_coco_transformer and
place it into logs
. To run the demo on a couple of example segmentation maps
included in the repository, run
streamlit run scripts/sample_conditional.py -- -r logs/2021-01-20T16-04-20_coco_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.coco.Examples}}}"
ADE20k
Download 2020-11-20T21-45-44_ade20k_transformer and
place it into logs
. To run the demo on a couple of example segmentation maps
included in the repository, run
streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T21-45-44_ade20k_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.ade20k.Examples}}}"
Scene Image Synthesis
Scene image generation based on bounding box conditionals as done in our CVPR2021 AI4CC workshop paper High-Resolution Complex Scene Synthesis with Transformers (see talk on workshop page). Supporting the datasets COCO and Open Images.
Training
Download first-stage models COCO-8k-VQGAN for COCO or COCO/Open-Images-8k-VQGAN for Open Images.
Change ckpt_path
in data/coco_scene_images_transformer.yaml
and data/open_images_scene_images_transformer.yaml
to point to the downloaded first-stage models.
Download the full COCO/OI datasets and adapt data_path
in the same files, unless working with the 100 files provided for training and validation suits your needs already.
Code can be run with
python main.py --base configs/coco_scene_images_transformer.yaml -t True --gpus 0,
or
python main.py --base configs/open_images_scene_images_transformer.yaml -t True --gpus 0,
Sampling
Train a model as described above or download a pre-trained model:
- Open Images 1 billion parameter model available that trained 100 epochs. On 256x256 pixels, FID 41.48±0.21, SceneFID 14.60±0.15, Inception Score 18.47±0.27. The model was trained with 2d crops of images and is thus well-prepared for the task of generating high-resolution images, e.g. 512x512.
- Open Images distilled version of the above model with 125 million parameters allows for sampling on smaller GPUs (4 GB is enough for sampling 256x256 px images). Model was trained for 60 epochs with 10% soft loss, 90% hard loss. On 256x256 pixels, FID 43.07±0.40, SceneFID 15.93±0.19, Inception Score 17.23±0.11.
- COCO 30 epochs
- COCO 60 epochs (find model statistics for both COCO versions in
assets/coco_scene_images_training.svg
)
When downloading a pre-trained model, remember to change ckpt_path
in configs/*project.yaml
to point to your downloaded first-stage model (see ->Training).
Scene image generation can be run with
python scripts/make_scene_samples.py --outdir=/some/outdir -r /path/to/pretrained/model --resolution=512,512
Training on custom data
Training on your own dataset can be beneficial to get better tokens and hence better images for your domain.
Those are the steps to follow to make this work:
- install the repo with
conda env create -f environment.yaml
, conda activate taming
and pip install -e .
- put your .jpg files in a folder
your_folder
- create 2 text files a
xx_train.txt
and xx_test.txt
that point to the files in your training and test set respectively (for example find $(pwd)/your_folder -name "*.jpg" > train.txt
)
- adapt
configs/custom_vqgan.yaml
to point to these 2 files
- run
python main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1
to
train on two GPUs. Use --gpus 0,
(with a trailing comma) to train on a single GPU.
Data Preparation
ImageNet
The code will try to download (through Academic
Torrents) and prepare ImageNet the first time it
is used. However, since ImageNet is quite large, this requires a lot of disk
space and time. If you already have ImageNet on your disk, you can speed things
up by putting the data into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
(which defaults to
~/.cache/autoencoders/data/ILSVRC2012_{split}/data/
), where {split}
is one
of train
/validation
. It should have the following structure:
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│ ├── n01440764_10026.JPEG
│ ├── n01440764_10027.JPEG
│ ├── ...
├── n01443537
│ ├── n01443537_10007.JPEG
│ ├── n01443537_10014.JPEG
│ ├── ...
├── ...
If you haven’t extracted the data, you can also place
ILSVRC2012_img_train.tar
/ILSVRC2012_img_val.tar
(or symlinks to them) into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/
/
${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/
, which will then be
extracted into above structure without downloading it again. Note that this
will only happen if neither a folder
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
nor a file
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready
exist. Remove them
if you want to force running the dataset preparation again.
You will then need to prepare the depth data using
MiDaS. Create a symlink
data/imagenet_depth
pointing to a folder with two subfolders train
and
val
, each mirroring the structure of the corresponding ImageNet folder
described above and containing a png
file for each of ImageNet’s JPEG
files. The png
encodes float32
depth values obtained from MiDaS as RGBA
images. We provide the script scripts/extract_depth.py
to generate this data.
Please note that this script uses MiDaS via PyTorch
Hub. When we prepared the data,
the hub provided the MiDaS
v2.0 version, but now it
provides a v2.1 version. We haven’t tested our models with depth maps obtained
via v2.1 and if you want to make sure that things work as expected, you must
adjust the script to make sure it explicitly uses
v2.0!
CelebA-HQ
Create a symlink data/celebahq
pointing to a folder containing the .npy
files of CelebA-HQ (instructions to obtain them can be found in the PGGAN
repository).
FFHQ
Create a symlink data/ffhq
pointing to the images1024x1024
folder obtained
from the FFHQ repository.
S-FLCKR
Unfortunately, we are not allowed to distribute the images we collected for the
S-FLCKR dataset and can therefore only give a description how it was produced.
There are many resources on collecting images from the
web to get started.
We collected sufficiently large images from flickr
(see data/flickr_tags.txt
for a full list of tags used to find images)
and various subreddits
(see data/subreddits.txt
for all subreddits that were used).
Overall, we collected 107625 images, and split them randomly into 96861
training images and 10764 validation images. We then obtained segmentation
masks for each image using DeepLab v2
trained on COCO-Stuff. We used a PyTorch
reimplementation and include an
example script for this process in scripts/extract_segmentation.py
.
COCO
Create a symlink data/coco
containing the images from the 2017 split in
train2017
and val2017
, and their annotations in annotations
. Files can be
obtained from the COCO webpage. In addition, we use
the Stuff+thing PNG-style annotations on COCO 2017
trainval
annotations from COCO-Stuff, which
should be placed under data/cocostuffthings
.
ADE20k
Create a symlink data/ade20k_root
containing the contents of
ADEChallengeData2016.zip
from the MIT Scene Parsing Benchmark.
Training models
FacesHQ
Train a VQGAN with
python main.py --base configs/faceshq_vqgan.yaml -t True --gpus 0,
Then, adjust the checkpoint path of the config key
model.params.first_stage_config.params.ckpt_path
in
configs/faceshq_transformer.yaml
(or download
2020-11-09T13-33-36_faceshq_vqgan and place into logs
, which
corresponds to the preconfigured checkpoint path), then run
python main.py --base configs/faceshq_transformer.yaml -t True --gpus 0,
D-RIN
Train a VQGAN on ImageNet with
python main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,
or download a pretrained one from 2020-09-23T17-56-33_imagenet_vqgan
and place under logs
. If you trained your own, adjust the path in the config
key model.params.first_stage_config.params.ckpt_path
of
configs/drin_transformer.yaml
.
Train a VQGAN on Depth Maps of ImageNet with
python main.py --base configs/imagenetdepth_vqgan.yaml -t True --gpus 0,
or download a pretrained one from 2020-11-03T15-34-24_imagenetdepth_vqgan
and place under logs
. If you trained your own, adjust the path in the config
key model.params.cond_stage_config.params.ckpt_path
of
configs/drin_transformer.yaml
.
To train the transformer, run
python main.py --base configs/drin_transformer.yaml -t True --gpus 0,
More Resources
Comparing Different First Stage Models
The reconstruction and compression capabilities of different fist stage models can be analyzed in this colab notebook.
In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384,
a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI’s DALL-E (which has f=8 and 8192
codebook entries).

Other
Text-to-Image Optimization via CLIP
VQGAN has been successfully used as an image generator guided by the CLIP model, both for pure image generation
from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:

Text prompt: ‘A bird drawn by a child’
Shout-outs
Thanks to everyone who makes their code and models available. In particular,
BibTeX
@misc{esser2020taming,
title={Taming Transformers for High-Resolution Image Synthesis},
author={Patrick Esser and Robin Rombach and Björn Ommer},
year={2020},
eprint={2012.09841},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Taming Transformers for High-Resolution Image Synthesis
CVPR 2021 (Oral)
Taming Transformers for High-Resolution Image Synthesis
Patrick Esser*, Robin Rombach*, Björn Ommer
* equal contribution
tl;dr We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.
News
2022
2021
beta=1.0
). Uselegacy=False
in the quantizer config to enable it. Thanks richcmwang and wcshin-git!scripts/sample_fast.py
.Requirements
A suitable conda environment named
taming
can be created and activated with:Overview of pretrained models
The following table provides an overview of all models that are currently available. FID scores were evaluated using torch-fidelity. For reference, we also include a link to the recently released autoencoder of the DALL-E model. See the corresponding colab notebook for a comparison and discussion of reconstruction capabilities.
Running pretrained models
The commands below will start a streamlit demo which supports sampling at different resolutions and image completions. To run a non-interactive version of the sampling process, replace
streamlit run scripts/sample_conditional.py --
bypython scripts/make_samples.py --outdir <path_to_write_samples_to>
and keep the remaining command line arguments.To sample from unconditional or class-conditional models, run
python scripts/sample_fast.py -r <path/to/config_and_checkpoint>
. We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models, respectively.S-FLCKR
You can also run this model in a Colab notebook, which includes all necessary steps to start sampling.
Download the 2020-11-09T13-31-51_sflckr folder and place it into
logs
. Then, runImageNet
Download the 2021-04-03T19-39-50_cin_transformer folder and place it into logs. Sampling from the class-conditional ImageNet model does not require any data preparation. To produce 50 samples for each of the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus sampling and temperature t=1.0, run
To restrict the model to certain classes, provide them via the
--classes
argument, separated by commas. For example, to sample 50 ostriches, border collies and whiskey jugs, runWe recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.
FFHQ/CelebA-HQ
Download the 2021-04-23T18-19-01_ffhq_transformer and 2021-04-23T18-11-19_celebahq_transformer folders and place them into logs. Again, sampling from these unconditional models does not require any data preparation. To produce 50000 samples, with k=250 for top-k sampling, p=1.0 for nucleus sampling and temperature t=1.0, run
for FFHQ and
to sample from the CelebA-HQ model. For both models it can be advantageous to vary the top-k/top-p parameters for sampling.
FacesHQ
Download 2020-11-13T21-41-45_faceshq_transformer and place it into
logs
. Follow the data preparation steps for CelebA-HQ and FFHQ. RunD-RIN
Download 2020-11-20T12-54-32_drin_transformer and place it into
logs
. To run the demo on a couple of example depth maps included in the repository, runTo run the demo on the complete validation set, first follow the data preparation steps for ImageNet and then run
COCO
Download 2021-01-20T16-04-20_coco_transformer and place it into
logs
. To run the demo on a couple of example segmentation maps included in the repository, runADE20k
Download 2020-11-20T21-45-44_ade20k_transformer and place it into
logs
. To run the demo on a couple of example segmentation maps included in the repository, runScene Image Synthesis
Training
Download first-stage models COCO-8k-VQGAN for COCO or COCO/Open-Images-8k-VQGAN for Open Images. Change
ckpt_path
indata/coco_scene_images_transformer.yaml
anddata/open_images_scene_images_transformer.yaml
to point to the downloaded first-stage models. Download the full COCO/OI datasets and adaptdata_path
in the same files, unless working with the 100 files provided for training and validation suits your needs already.Code can be run with
python main.py --base configs/coco_scene_images_transformer.yaml -t True --gpus 0,
orpython main.py --base configs/open_images_scene_images_transformer.yaml -t True --gpus 0,
Sampling
Train a model as described above or download a pre-trained model:
assets/coco_scene_images_training.svg
)When downloading a pre-trained model, remember to change
ckpt_path
inconfigs/*project.yaml
to point to your downloaded first-stage model (see ->Training).Scene image generation can be run with
python scripts/make_scene_samples.py --outdir=/some/outdir -r /path/to/pretrained/model --resolution=512,512
Training on custom data
Training on your own dataset can be beneficial to get better tokens and hence better images for your domain. Those are the steps to follow to make this work:
conda env create -f environment.yaml
,conda activate taming
andpip install -e .
your_folder
xx_train.txt
andxx_test.txt
that point to the files in your training and test set respectively (for examplefind $(pwd)/your_folder -name "*.jpg" > train.txt
)configs/custom_vqgan.yaml
to point to these 2 filespython main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1
to train on two GPUs. Use--gpus 0,
(with a trailing comma) to train on a single GPU.Data Preparation
ImageNet
The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into
${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
(which defaults to~/.cache/autoencoders/data/ILSVRC2012_{split}/data/
), where{split}
is one oftrain
/validation
. It should have the following structure:If you haven’t extracted the data, you can also place
ILSVRC2012_img_train.tar
/ILSVRC2012_img_val.tar
(or symlinks to them) into${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/
/${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/
, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
nor a file${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready
exist. Remove them if you want to force running the dataset preparation again.You will then need to prepare the depth data using MiDaS. Create a symlink
data/imagenet_depth
pointing to a folder with two subfolderstrain
andval
, each mirroring the structure of the corresponding ImageNet folder described above and containing apng
file for each of ImageNet’sJPEG
files. Thepng
encodesfloat32
depth values obtained from MiDaS as RGBA images. We provide the scriptscripts/extract_depth.py
to generate this data. Please note that this script uses MiDaS via PyTorch Hub. When we prepared the data, the hub provided the MiDaS v2.0 version, but now it provides a v2.1 version. We haven’t tested our models with depth maps obtained via v2.1 and if you want to make sure that things work as expected, you must adjust the script to make sure it explicitly uses v2.0!CelebA-HQ
Create a symlink
data/celebahq
pointing to a folder containing the.npy
files of CelebA-HQ (instructions to obtain them can be found in the PGGAN repository).FFHQ
Create a symlink
data/ffhq
pointing to theimages1024x1024
folder obtained from the FFHQ repository.S-FLCKR
Unfortunately, we are not allowed to distribute the images we collected for the S-FLCKR dataset and can therefore only give a description how it was produced. There are many resources on collecting images from the web to get started. We collected sufficiently large images from flickr (see
data/flickr_tags.txt
for a full list of tags used to find images) and various subreddits (seedata/subreddits.txt
for all subreddits that were used). Overall, we collected 107625 images, and split them randomly into 96861 training images and 10764 validation images. We then obtained segmentation masks for each image using DeepLab v2 trained on COCO-Stuff. We used a PyTorch reimplementation and include an example script for this process inscripts/extract_segmentation.py
.COCO
Create a symlink
data/coco
containing the images from the 2017 split intrain2017
andval2017
, and their annotations inannotations
. Files can be obtained from the COCO webpage. In addition, we use the Stuff+thing PNG-style annotations on COCO 2017 trainval annotations from COCO-Stuff, which should be placed underdata/cocostuffthings
.ADE20k
Create a symlink
data/ade20k_root
containing the contents of ADEChallengeData2016.zip from the MIT Scene Parsing Benchmark.Training models
FacesHQ
Train a VQGAN with
Then, adjust the checkpoint path of the config key
model.params.first_stage_config.params.ckpt_path
inconfigs/faceshq_transformer.yaml
(or download 2020-11-09T13-33-36_faceshq_vqgan and place intologs
, which corresponds to the preconfigured checkpoint path), then runD-RIN
Train a VQGAN on ImageNet with
or download a pretrained one from 2020-09-23T17-56-33_imagenet_vqgan and place under
logs
. If you trained your own, adjust the path in the config keymodel.params.first_stage_config.params.ckpt_path
ofconfigs/drin_transformer.yaml
.Train a VQGAN on Depth Maps of ImageNet with
or download a pretrained one from 2020-11-03T15-34-24_imagenetdepth_vqgan and place under
logs
. If you trained your own, adjust the path in the config keymodel.params.cond_stage_config.params.ckpt_path
ofconfigs/drin_transformer.yaml
.To train the transformer, run
More Resources
Comparing Different First Stage Models
The reconstruction and compression capabilities of different fist stage models can be analyzed in this colab notebook. In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384, a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI’s DALL-E (which has f=8 and 8192 codebook entries).

Other
Text-to-Image Optimization via CLIP
VQGAN has been successfully used as an image generator guided by the CLIP model, both for pure image generation from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:
Text prompt: ‘A bird drawn by a child’
Shout-outs
Thanks to everyone who makes their code and models available. In particular,
BibTeX