We are excited to announce the open-source release of our latest work: Tora: Trajectory-oriented Diffusion Transformer for Video Generation. It is the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions for video generation.
2024.2.5: Support multiple GPUs training with Accelerator DeepSpeed. Config DeepSpeed zero_stage 2 and offload_optimizer_device cpu, you can do full finetuning animate-anything with 4x16G V100 GPUs and SVD with 4x24G A10 GPUs now.
2023.12.27: Support finetuning based on SVD (stable video diffusion) model. Update SVD based animate_anything_svd_v1.0
2023.12.18: Update model to animate_anything_512_v1.02
Features Planned
💥 Transparent video generatinon. (Take a RGBA image as input and output animated RGBA videos)
✅ reproduce Transparent VAE encoder and decoder according to LayerDiffuse.
✅ finetune 3D-Unet to support the basic RGBA-image-to-RGBA-video capability.
💥 Enhanced prompt-following: generating long-detailed captions using LLaVA.
💥 Replace the U-Net with DiffusionTransformer (DiT) as the base model.
💥 Variable resolutions and aspect ratios.
💥 Support Huggingface Demo / Google Colab.
✅ support svd video2video Google Colab demo. See colab.ipynb.
Please download the pretrained model to output/latent, then run the following command. Please replace the {download_model} to your download model name:
To control the motion area, we can use the labelme to generate a binary mask. First, we use labelme to draw the polygon for the reference image.
Then we run the following command to transform the labelme json file to a mask.
labelme_json_to_dataset qingming2.json
Then run the following command for inference:
python train.py --config output/latent/{download_model}/config.yaml --eval validation_data.prompt_image=example/qingming2.jpg validation_data.prompt='Peoples are walking on the street.' validation_data.mask=example/qingming2_label.jpg
User can adjust the motion strength by using the mask motion model:
python train.py --config output/latent/{download_model}/
config.yaml --eval validation_data.prompt_image=example/qingming2.jpg validation_data.prompt='Peoples are walking on the street.' validation_data.mask=example/qingming2_label.jpg validation_data.strength=5
Video super resolution
The model output low res videos, you can use video super resolution model to output high res videos. For example, we can use Real-CUGAN cartoon style video super resolution:
git clone https://github.com/bilibili/ailab.git
cd ailab/Real-CUGAN
python inference_video.py
Training
Using Captions
You can use caption files when training with video. Simply place the videos into a folder and create a json with captions like this:
[
{"caption": "Cute monster character flat design animation video", "video": "000001_000050/1066697179.mp4"},
{"caption": "Landscape of the cherry blossom", "video": "000001_000050/1066688836.mp4"}
]
Then in your config, make sure to set dataset_types to video_json and set the video_dir and video json path like this:
The configuration uses a YAML config borrowed from Tune-A-Video repositories.
All configuration details are placed in example/train_mask_motion.yaml. Each parameter has a definition for what it does.
Finetuning anymate-anything
You can finetune anymate-anything with text, motion mask, motion strength guidance on your own dataset. The following config requires around 30G GPU RAM. You can reduce the train_batch_size, train_data.width, train_data.height, and n_sample_frames in the config to reduce GPU RAM:
Stable Video Diffusion (SVD) img2vid model can generate high resolution videos. However, it does not have the text or motion mask control. You can finetune SVD with motioin mask guidance with the following commands and pretrained SVD model. This config requires around 80G GPU RAM.
I strongly recommend use multiple GPUs training with Accelerator, which will largely decrease the VRAM requirement. Please first config the accelerator with deepspeed. An example config is located in example/deepspeed.yaml.
And then replace ‘python train_xx.py …’ commands above with ‘accelerate launch train_xx.py …’, for example:
We provide several examples in the svd_video2video_examples directory.
Bibtex
Please cite this paper if you find the code is useful for your research:
@misc{dai2023animateanything,
title={AnimateAnything: Fine-Grained Open Domain Image Animation with Motion Guidance},
author={Zuozhuo Dai and Zhenghao Zhang and Yao Yao and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
year={2023},
eprint={2311.12886},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Zuozhuo Dai, Zhenghao Zhang, Menghao Li, Junchao Liao, Siyu Zhu, Long Qin, Weizhi Wang
Friendship Link 🔥
Showcases
https://github.com/alibaba/animate-anything/assets/1107525/e2659674-c813-402a-8a85-e620f0a6a454
Framework
News 🔥
2024.2.5: Support multiple GPUs training with Accelerator DeepSpeed. Config DeepSpeed zero_stage 2 and offload_optimizer_device cpu, you can do full finetuning animate-anything with 4x16G V100 GPUs and SVD with 4x24G A10 GPUs now.
2023.12.27: Support finetuning based on SVD (stable video diffusion) model. Update SVD based animate_anything_svd_v1.0
2023.12.18: Update model to animate_anything_512_v1.02
Features Planned
Getting Started
This repository is based on Text-To-Video-Finetuning.
Create Conda Environment (Optional)
It is recommended to install Anaconda.
Windows Installation: https://docs.anaconda.com/anaconda/install/windows/
Linux Installation: https://docs.anaconda.com/anaconda/install/linux/
Python Requirements
Running inference
Please download the pretrained model to output/latent, then run the following command. Please replace the {download_model} to your download model name:
To control the motion area, we can use the labelme to generate a binary mask. First, we use labelme to draw the polygon for the reference image.
Then we run the following command to transform the labelme json file to a mask.
Then run the following command for inference:
User can adjust the motion strength by using the mask motion model:
Video super resolution
The model output low res videos, you can use video super resolution model to output high res videos. For example, we can use Real-CUGAN cartoon style video super resolution:
Training
Using Captions
You can use caption files when training with video. Simply place the videos into a folder and create a json with captions like this:
Then in your config, make sure to set dataset_types to video_json and set the video_dir and video json path like this:
Process Automatically
You can automatically caption the videos using the Video-BLIP2-Preprocessor Script and set the dataset_types and json_path like this:
Configuration
The configuration uses a YAML config borrowed from Tune-A-Video repositories.
All configuration details are placed in
example/train_mask_motion.yaml. Each parameter has a definition for what it does.Finetuning anymate-anything
You can finetune anymate-anything with text, motion mask, motion strength guidance on your own dataset. The following config requires around 30G GPU RAM. You can reduce the train_batch_size, train_data.width, train_data.height, and n_sample_frames in the config to reduce GPU RAM:
We also support lora finetuning:
Finetune Stable Video Diffusion:
Stable Video Diffusion (SVD) img2vid model can generate high resolution videos. However, it does not have the text or motion mask control. You can finetune SVD with motioin mask guidance with the following commands and pretrained SVD model. This config requires around 80G GPU RAM.
If you only want to finetune SVD on your own dataset without motion mask control, please use the following config:
Multiple GPUs training
I strongly recommend use multiple GPUs training with Accelerator, which will largely decrease the VRAM requirement. Please first config the accelerator with deepspeed. An example config is located in example/deepspeed.yaml.
And then replace ‘python train_xx.py …’ commands above with ‘accelerate launch train_xx.py …’, for example:
SVD video2video
We now release the finetuned vid2vid SVD model, you can try it via the gradio UI.
Please download the vid2vid_SVD model and extract it to output/svd/{download_model} and then run the command:
We provide several examples in the svd_video2video_examples directory.
Bibtex
Please cite this paper if you find the code is useful for your research:
Shoutouts