[🔥CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generation
Zhenghao Zhang*, Junchao Liao*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
* equal contribution
This is the official repository for paper “Tora: Trajectory-oriented Diffusion Transformer for Video Generation”.
💡 Abstract
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
📣 Updates
2025/07/08 🔥🔥 Our latest work, Tora2, has been accepted by ACM MM25. Tora2 builds on Tora with design improvements, enabling enhanced appearance and motion customization for multiple entities.
2025/01/06 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
2024/12/13 SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
2024/12/09 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to this for details.
2024/11/25 🔥Text-to-Video training code released.
2024/10/31 Model weights uploaded to HuggingFace. We also provided an English demo on ModelScope.
2024/10/23 🔥🔥Our ModelScope Demo is launched. Welcome to try it out! We also upload the model weights to ModelScope.
2024/10/21 Thanks to @kijai for supporting Tora in ComfyUI! Link
2024/10/15 🔥🔥We released our inference code and model weights. Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.
2024/08/27 We released our v2 paper including appendix.
2024/07/31 We submitted our paper on arXiv and released our project page.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
# Clone this repository.
git clone https://github.com/alibaba/Tora.git
cd Tora
# Install Pytorch (we use Pytorch 2.4.0) and torchvision following the official instructions: https://pytorch.org/get-started/previous-versions/. For example:
conda create -n tora python==3.10
conda activate tora
conda install pytorch==2.4.0 torchvision==0.19.0 pytorch-cuda=12.1 -c pytorch -c nvidia
# Install requirements
cd modules/SwissArmyTransformer
pip install -e .
cd ../../sat
pip install -r requirements.txt
cd ..
📦 Model Weights
Folder Structure
Tora
└── sat
└── ckpts
├── t5-v1_1-xxl
│ ├── model-00001-of-00002.safetensors
│ └── ...
├── vae
│ └── 3d-vae.pt
├── tora
│ ├── i2v
│ │ └── mp_rank_00_model_states.pt
│ └── t2v
│ └── mp_rank_00_model_states.pt
└── CogVideoX-5b-sat # for training stage 1
└── mp_rank_00_model_states.pt
Download Links
Note: Downloading the tora weights requires following the CogVideoX License. You can choose one of the following options: HuggingFace, ModelScope, or native links.
After downloading the model weights, you can put them in the Tora/sat/ckpts folder.
HuggingFace
# This can be faster
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Alibaba-Research-Intelligence-Computing/Tora --local-dir ckpts
or
# use git
git lfs install
git clone https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora
ModelScope
SDK
from modelscope import snapshot_download
model_dir = snapshot_download('xiaoche/Tora')
Tora t2v model weights: Link. Downloading this weight requires following the CogVideoX License.
🔄 Inference
Text to Video
It requires around 30 GiB GPU memory tested on NVIDIA A100.
cd sat
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True torchrun --standalone --nproc_per_node=$N_GPU sample_video.py --base configs/tora/model/cogvideox_5b_tora.yaml configs/tora/inference_sparse.yaml --load ckpts/tora/t2v --output-dir samples --point_path trajs/coaster.txt --input-file assets/text/t2v/examples.txt
You can change the --input-file and --point_path to your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.
Replace $N_GPU with the number of GPUs you want to use.
The first frame images should be placed in the --img_dir. The names of these images should be specified in the corresponding text prompt in --input-file, seperated by @@.
Recommendations for Text Prompts
For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
You can refer to the following resources for guidance:
@inproceedings{zhang2025tora,
title={Tora: Trajectory-oriented diffusion transformer for video generation},
author={Zhang, Zhenghao and Liao, Junchao and Li, Menghao and Dai, Zuozhuo and Qiu, Bingxue and Zhu, Siyu and Qin, Long and Wang, Weizhi},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={2063--2073},
year={2025}
}
Zhenghao Zhang*, Junchao Liao*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang
* equal contribution
This is the official repository for paper “Tora: Trajectory-oriented Diffusion Transformer for Video Generation”.
💡 Abstract
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.
📣 Updates
2025/07/08🔥🔥 Our latest work, Tora2, has been accepted by ACM MM25. Tora2 builds on Tora with design improvements, enabling enhanced appearance and motion customization for multiple entities.2025/05/24We open-sourced a LoRA-finetuned model of Wan. It turns things in the image into fluffy toys. Check this out: https://github.com/alibaba/wan-toy-transform2025/01/06🔥🔥We released Tora Image-to-Video, including inference code and model weights.2024/12/13SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.2024/12/09🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to this for details.2024/11/25🔥Text-to-Video training code released.2024/10/31Model weights uploaded to HuggingFace. We also provided an English demo on ModelScope.2024/10/23🔥🔥Our ModelScope Demo is launched. Welcome to try it out! We also upload the model weights to ModelScope.2024/10/21Thanks to @kijai for supporting Tora in ComfyUI! Link2024/10/15🔥🔥We released our inference code and model weights. Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.2024/08/27We released our v2 paper including appendix.2024/07/31We submitted our paper on arXiv and released our project page.📑 Table of Contents
🎞️ Showcases
https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1
https://github.com/user-attachments/assets/7e7dbe87-a8ba-4710-afd0-9ef528ec329b
https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50
All videos are available in this Link
✅ TODO List
🧨 Diffusers verision
Please refer to the diffusers version for details.
🐍 Installation
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
📦 Model Weights
Folder Structure
Download Links
Note: Downloading the
toraweights requires following the CogVideoX License. You can choose one of the following options: HuggingFace, ModelScope, or native links. After downloading the model weights, you can put them in theTora/sat/ckptsfolder.HuggingFace
or
ModelScope
Native
🔄 Inference
Text to Video
It requires around 30 GiB GPU memory tested on NVIDIA A100.
You can change the
--input-fileand--point_pathto your own prompts and trajectory points files. Please note that the trajectory is drawn on a 256x256 canvas.Replace
$N_GPUwith the number of GPUs you want to use.Image to Video
The first frame images should be placed in the
--img_dir. The names of these images should be specified in the corresponding text prompt in--input-file, seperated by@@.Recommendations for Text Prompts
For text prompts, we highly recommend using GPT-4 to enhance the details. Simple prompts may negatively impact both visual quality and motion control effectiveness.
You can refer to the following resources for guidance:
🖥️ Gradio Demo
Usage:
🧠 Training
Data Preparation
Following this guide https://github.com/THUDM/CogVideo/blob/main/sat/README.md#preparing-the-dataset, structure the datasets as follows:
Training data examples are in
sat/training_examplesText to Video
It requires around 60 GiB GPU memory tested on NVIDIA A100.
Replace
$N_GPUwith the number of GPUs you want to use.🎯 Troubleshooting
1. ValueError: Non-consecutive added token…
Upgrade the transformers package to 4.44.2. See this issue.
🤝 Acknowledgements
We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:
Special thanks to the contributors of these libraries for their hard work and dedication!
📄 Our previous work
📚 Citation