HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding

HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.

Project Updates

🔥 News: 2025/08/08: We release HaploOmni in the branch (model).
🔥 News: 2025/05/01: HaploVL is accepted by ICML2025.

Highlights

This repository contains the PyTorch implementation, model weights, and training code for Haplo.

🌟 Unified Architecture: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation
🌟 Efficient Training: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption
🌟 Scalable Design: Flexible framework supporting both Ascend NPU and GPU environments
🌟 Extended Capabilities: Native support for multiple image understanding and video processing

Getting Started

Installation

# Option1:
pip install git+https://github.com/Tencent/HaploVLM.git

# Option2:
git clone https://github.com/Tencent/HaploVLM.git
cd HaploVLM
pip install -e . -v

Quick Start

Basic usage example:

from haplo import HaploProcessor, HaploForConditionalGeneration

processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro')
model = HaploForConditionalGeneration.from_pretrained(
    'stevengrove/Haplo-7B-Pro',
    torch_dtype=torch.bfloat16
).to('cuda')

conversation = [
    {'role': 'user', 'content': [
        {'type': 'text', 'text': 'Describe this image.'},
        {'type': 'image', 'path': 'assets/example-image.png'}
    ]}
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors='pt'
).to('cuda')

outputs = model.generate(inputs)
print(processor.decode(outputs[0]))

Gradio Demo

Launch an interactive demo:

python demo/demo.py \
    -m "stevengrove/Haplo-7B-Pro-Video" \
    --server-port 8080 \
    --device cuda \
    --dtype bfloat16

Multi-Modal Capabilities

Category	Example
Single Image Understanding
Multi-Image Understanding
Video Understanding

Acknowledgement

@article{HaploVL,
    title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
    author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
    journal={arXiv preprint arXiv:2503.14694},
    year={2025}
}
@article{xiao2025haploomni,
  title={Haploomni: Unified single transformer for multimodal video understanding and generation},
  author={Xiao, Yicheng and Song, Lin and Yang, Rui and Cheng, Cheng and Xu, Zunnan and Zhang, Zhaoyang and Ge, Yixiao and Li, Xiu and Shan, Ying},
  journal={arXiv preprint arXiv:2506.02975},
  year={2025}
}