HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.
Project Updates
🔥 News: 2025/08/08: We release HaploOmni in the branch (model).
🔥 News: 2025/05/01: HaploVL is accepted by ICML2025.
Highlights
This repository contains the PyTorch implementation, model weights, and training code for Haplo.
🌟 Unified Architecture: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation 🌟 Efficient Training: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption 🌟 Scalable Design: Flexible framework supporting both Ascend NPU and GPU environments 🌟 Extended Capabilities: Native support for multiple image understanding and video processing
@article{HaploVL,
title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
journal={arXiv preprint arXiv:2503.14694},
year={2025}
}
@article{xiao2025haploomni,
title={Haploomni: Unified single transformer for multimodal video understanding and generation},
author={Xiao, Yicheng and Song, Lin and Yang, Rui and Cheng, Cheng and Xu, Zunnan and Zhang, Zhaoyang and Ge, Yixiao and Li, Xiu and Shan, Ying},
journal={arXiv preprint arXiv:2506.02975},
year={2025}
}
HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.
Project Updates
2025/08/08: We release HaploOmni in the branch (model).2025/05/01: HaploVL is accepted by ICML2025.Highlights
This repository contains the PyTorch implementation, model weights, and training code for Haplo.
🌟 Unified Architecture: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation
🌟 Efficient Training: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption
🌟 Scalable Design: Flexible framework supporting both Ascend NPU and GPU environments
🌟 Extended Capabilities: Native support for multiple image understanding and video processing
Getting Started
Installation
Quick Start
Basic usage example:
Gradio Demo
Launch an interactive demo:
Multi-Modal Capabilities
Acknowledgement