目录

ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

🫖 Introduction

In this work, we present ERTACache, a principled and efficient caching framework for accelerating diffusion model inference. By decomposing cache-induced degradation into feature shift and step amplification errors, we develop a dual-path correction strategy that combines offline-calibrated reuse scheduling, trajectory-aware timestep adjustment, and closed-form residual rectification. The following figure gives an overview of our ERTACache framework, which adopts a dual-dimensional correction strategy: (1) we first perform offline policy calibration by searching for a globally effective cache schedule using residual error profiling; (2) we then introduce a trajectory-aware timestep adjustment mechanism to mitigate integration drift caused by reused features; (3) finally, we propose an explicit error rectification that analytically approximates and rectifies the additive error introduced by cached outputs, enabling accurate reconstruction with negligible overhead.

visualization

As the figure shown below, ERTACache preserves fine-grained visual details and frame-to-frame consistency, outperforming TeaCache and matching the performance of the non-cache reference. In video generation tasks using CogVideoX, Wan2.1-1.3B, and OperaSora 1.2, ERTA-Cache achieves noticeably better temporal consistency, particularly between the first and last frames. When applied to the Flux-dev 1.0 image model, it enhances visual richness and details. These results highlight ERTACache as a uniquely effective solution that balances visual quality and computational efficiency for consistent video generation.

visualization

Unlike prior heuristics-based methods, ERTACache provides a theoretically grounded yet lightweight solution that significantly reduces redundant computations while maintaining high-fidelity outputs. Empirical results across multiple benchmarks validate its effectiveness and generality, highlighting its potential as a practical solution for efficient generative sampling.

🎉 Supported Models

Text to Video

  • ERTACache4Wan2.1
  • ERTACache4CogVideoX-2B
  • ERTACache4OpenSora1.2

Text to Image

  • ERTACache4FLUX

📈 Inference Comparisons on a Single A800

T2V Model Method LPIPS SSIM PSNR Latency(s)
OpenSora 1.2 TeaCache 0.2511 0.7477 19.10 19.84
ERTACache 0.1659 0.8170 22.34 18.04
CogvideoX-2B TeaCache 0.2057 0.7614 20.97 26.88
ERTACache 0.1012 0.8702 26.44 26.78
Wan2.1-1.3B TeaCache 0.2913 0.5685 16.17 99.5
ERTACache 0.1095 0.8200 23.77 91.7
FLUX-dev 1.0 TeaCache 0.4427 0.7445 16.47 14.21
ERTACache 0.3029 0.8962 20.51 14.01

Installation

The running environment set-up depends on the specific model. For example, for FLUX, you need to install the FLUX packages:

pip install --upgrade diffusers[torch] transformers protobuf tokenizers sentencepiece

Usage

For all the supported models, you can enter in the specific folder (for example: go to \ERTACache4FLUX ), then use the following command to get the outputs saved in the .\sample folder

sh run.sh

💐 Acknowledgement

This repository is built based on VideoSys, Diffusers, Open-Sora, CogVideoX, FLUX, Wan2.1, Thanks for their contributions!

🔒 License

邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9 京公网安备 11010802032778号