[2025.02.27] We release the inference code and model weights of EchoVideo. DownLoad
Introduction
EchoVideo is capable of generating a personalized video from a single photo and a text description. It excels in addressing issues related to “semantic conflict” and “copy-paste” problems. And demonstrates state-of-the-art performance.
Gallery
Strongly recommend visitingthis linkfor more results.
1. Text-to-Video Generation
Face-ID Preserving
Full-Body Preserving
2. Comparisons
EchoVideo
ConsisID
IDAnimator
Usage
Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12. Support both gpu and npu
Clone the repository:
git clone https://github.com/bytedance/EchoVideo
cd EchoVideo
Installation
pip install -r requirements.txt
Download Pretrained Weights
The details of download pretrained models are shown here.
Overall architecture of EchoVideo. By employing a meticulously designed IITF module and mitigating the over-reliance on input images, our model effectively unifies the semantic information between the input facial image and the textual prompt. This integration enables the generation of consistent characters with multi-view facial coherence, ensuring that the synthesized outputs maintain both visual and semantic fidelity across diverse perspectives.
Key Features
Illustration of facial information injection methods. (a) IITF. Facial and textual information are fused to ensure consistent guidance throughout the generation process. we propose IITF to fuse text and facial information, establishing a semantic bridge between facial and textual information, coordinating the influence of different information on character features, thereby ensuring the consistency of generated characters. IITF consists of two core components: facial feature alignment and conditional feature alignment. (b) Dual branch. Facial and textual information are independently injected through Cross Attention mechanisms, providing separate guidance for the generation process.
Benchmark
Model
Identity Average↑
Identity Variation↓
Inception Distance↓
Dynamic Degree↑
IDAnimator
0.349
0.032
159.11
0.280
ConsisID
0.414
0.094
200.40
0.871
pika
0.329
0.091
268.35
0.954
Ours
0.516
0.075
176.53
0.955
Acknowledgements
CogVideo: The DiT module we adpated from, and the VAE module we used.
If you find our work useful in your research, please consider citing the paper
@article{wei2025echovideo,
title={EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion},
author={Wei, Jiangchuan and Yan, Shiyue and Lin, Wenfeng and Liu, Boyuan and Chen, Renjie and Guo, Mingyu},
journal={arXiv preprint arXiv:2501.13452},
year={2025}
}
EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion
This repo contains PyTorch model definitions, pre-trained weights and inference code for our video generation model, EchoVideo.
News
[2025.02.27] We release the inference code and model weights of EchoVideo. DownLoad
Introduction
EchoVideo is capable of generating a personalized video from a single photo and a text description. It excels in addressing issues related to “semantic conflict” and “copy-paste” problems. And demonstrates state-of-the-art performance.
Gallery
Strongly recommend visiting this link for more results.
1. Text-to-Video Generation
2. Comparisons
Usage
Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12. Support both gpu and npu
Clone the repository:
Installation
Download Pretrained Weights
The details of download pretrained models are shown here.
Run Demo
Methods
Overall Architecture
Overall architecture of EchoVideo. By employing a meticulously designed IITF module and mitigating the over-reliance on input images, our model effectively unifies the semantic information between the input facial image and the textual prompt. This integration enables the generation of consistent characters with multi-view facial coherence, ensuring that the synthesized outputs maintain both visual and semantic fidelity across diverse perspectives.
Key Features
Illustration of facial information injection methods. (a) IITF. Facial and textual information are fused to ensure consistent guidance throughout the generation process. we propose IITF to fuse text and facial information, establishing a semantic bridge between facial and textual information, coordinating the influence of different information on character features, thereby ensuring the consistency of generated characters. IITF consists of two core components: facial feature alignment and conditional feature alignment. (b) Dual branch. Facial and textual information are independently injected through Cross Attention mechanisms, providing separate guidance for the generation process.
Benchmark
Acknowledgements
BibTeX
If you find our work useful in your research, please consider citing the paper