ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

🫖 Introduction

In this work, we present ERTACache, a principled and efficient caching framework for accelerating diffusion model inference. By decomposing cache-induced degradation into feature shift and step amplification errors, we develop a dual-path correction strategy that combines offline-calibrated reuse scheduling, trajectory-aware timestep adjustment, and closed-form residual rectification. The following figure gives an overview of our ERTACache framework, which adopts a dual-dimensional correction strategy: (1) we first perform offline policy calibration by searching for a globally effective cache schedule using residual error profiling; (2) we then introduce a trajectory-aware timestep adjustment mechanism to mitigate integration drift caused by reused features; (3) finally, we propose an explicit error rectification that analytically approximates and rectifies the additive error introduced by cached outputs, enabling accurate reconstruction with negligible overhead.

visualization

As the figure shown below, ERTACache preserves fine-grained visual details and frame-to-frame consistency, outperforming TeaCache and matching the performance of the non-cache reference. In video generation tasks using CogVideoX, Wan2.1-1.3B, and OperaSora 1.2, ERTA-Cache achieves noticeably better temporal consistency, particularly between the first and last frames. When applied to the Flux-dev 1.0 image model, it enhances visual richness and details. These results highlight ERTACache as a uniquely effective solution that balances visual quality and computational efficiency for consistent video generation.

visualization

Unlike prior heuristics-based methods, ERTACache provides a theoretically grounded yet lightweight solution that significantly reduces redundant computations while maintaining high-fidelity outputs. Empirical results across multiple benchmarks validate its effectiveness and generality, highlighting its potential as a practical solution for efficient generative sampling.

🎉 Supported Models

Text to Video

ERTACache4Wan2.1
ERTACache4CogVideoX-2B
ERTACache4OpenSora1.2

Text to Image

ERTACache4FLUX

📈 Inference Comparisons on a Single A800

T2V Model	Method	LPIPS	SSIM	PSNR	Latency(s)
OpenSora 1.2	TeaCache	0.2511	0.7477	19.10	19.84
OpenSora 1.2	ERTACache	0.1659	0.8170	22.34	18.04
CogvideoX-2B	TeaCache	0.2057	0.7614	20.97	26.88
CogvideoX-2B	ERTACache	0.1012	0.8702	26.44	26.78
Wan2.1-1.3B	TeaCache	0.2913	0.5685	16.17	99.5
Wan2.1-1.3B	ERTACache	0.1095	0.8200	23.77	91.7
FLUX-dev 1.0	TeaCache	0.4427	0.7445	16.47	14.21
FLUX-dev 1.0	ERTACache	0.3029	0.8962	20.51	14.01

Installation

The running environment set-up depends on the specific model. For example, for FLUX, you need to install the FLUX packages:

pip install --upgrade diffusers[torch] transformers protobuf tokenizers sentencepiece

Usage

For all the supported models, you can enter in the specific folder (for example: go to \ERTACache4FLUX ), then use the following command to get the outputs saved in the .\sample folder

sh run.sh

💐 Acknowledgement

This repository is built based on VideoSys, Diffusers, Open-Sora, CogVideoX, FLUX, Wan2.1, Thanks for their contributions!

🔒 License

The majority of this project is released under the MIT license as found in the LICENSE file.
For VideoSys, Diffusers, Open-Sora, CogVideoX, FLUX, Wan2.1, please follow their LICENSE.