MammothModa2 is a unified autoregressive-diffusion (AR-Diffusion) framework that seamlessly integrates multimodal understanding and generation within a single model. The latest version, Mammoth2.5, is powered by Qwen3-VL and a fine-grained Mixture-of-Experts (MoE) Diffusion Transformer (DiT) with 128 routed experts, bringing multimodal understanding, generation, and editing together in a single architecture. One unified model supports text-to-image, text-to-video, image editing, and video editing with state-of-the-art performance and compute-efficient inference.
Key Features:
Effective Fine-Grained MoE Architecture: Mammoth2.5 employs a fine-grained Mixture-of-Experts (MoE) design with 128 routed experts and Top-8 routing to scale the DiT backbone to 25B total parameters while activating only 3B per forward pass (12%). This yields over 12x faster inference than Wan2.2 A14B on a single device.
Unified Visual Generation & Editing: A single AR-Diffusion framework built on Qwen3-VL for multimodal understanding and an MoE DiT backbone for generation. One unified model supports text-to-image, text-to-video, image editing, and video editing, eliminating the need for separate task-specific models.
SOTA Performance with Compute-Efficient Inference:Video generation reaches top-tier open-source quality on VBench 2.0 (61.64, on par with HunyuanVideo 1.5 and LongCat-Video) with only 110s latency — 12x faster than Wan2.2 A14B and 18x faster than LongCat-Video. Video editing achieves SOTA across all three public benchmarks: #1 on OpenVE-Bench (3.86, surpassing the proprietary Kling O1), #1 on FiVE-Bench (87.41 vs. 73.53 for the next best), and best overall on Reco-Bench — all with up to ~10x faster inference than comparable editing baselines.
🎉 News
2026-02-15: 🔥Released MammothModa2.5 inference code for Video Generation and Video Editing! Model weights coming soon. Check out our Project Page.
2025-12-31: 🔥Released MammothModa2 with MoE DiT architecture, now supporting Video Generation! Check out our new Project Page. Code is available at MammothModa25.
2025-12-10: 🔥MammothModa2-Dev build upon Qwen3VL-8B supports Image Editing are now available at HuggingFace.
2025-10-01: 🔥MammothModa2-Preview models are now available at HuggingFace. Note: To use the Preview version, please switch to the qwen25vl branch.
Showcases
Text-to-Video Generation
Mammoth2.5 supports high-quality text-to-video generation with over 12x faster inference than dense models of comparable capacity.
Prompt
"A medium shot of a chameleon carefully crawling along a tree branch, its feet gripping tightly to the bark. The camera captures the slow, deliberate movements, the slight shifting of colors, and the independent movement of its eyes."
Motion
Prompt
"A man wearing a black leather jacket and sunglasses rides a motorcycle down a winding mountain road, the road is carved into the mountainside, the scenery is breathtaking with steep cliffs and deep valleys, the sky is clear and blue, the camera follows the motorcycle from behind, capturing the speed and freedom of the ride, the motorcycle is sleek and black, the man's jacket flutters in the wind, the scene is exhilarating and cinematic.
"
Scenery
Prompt
"A man wearing a green raincoat and boots walks through a dense forest in the rain, the trees are tall and create a canopy overhead, the rain is visible as it falls through the trees, the ground is covered in fallen leaves, the scene is moody and atmospheric, captured with a handheld camera, the man is slightly hunched, protecting himself from the rain, the forest is dark and mysterious, the rain creates a peaceful ambiance."
High-Speed Motion
Prompt
"In a magical, floating island world, a young adventurer with a jetpack soars through the sky, dodging floating rocks and mystical creatures. The camera follows the adventurer from behind, offering a sweeping view of the vast, interconnected islands, each with its unique terrain and ecosystem. The animation features fluid, high-speed flying sequences, with the adventurer narrowly avoiding obstacles and discovering hidden treasure."
import torch
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
from mammothmoda2.model import Mammothmoda2Model
# Mammothmoda2 model and processor loading.
model = Mammothmoda2Model.from_pretrained(
"bytedance-research/MammothModa2-Preview",
attn_implementation="flash_attention_2",
torch_dtype="bfloat16",
).to("cuda")
print(f"model.device={model.device}")
processor = AutoProcessor.from_pretrained("bytedance-research/MammothModa2-Preview")
# Mammothmoda2 inputs preprocessing.
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "doc/example0.png",
},
{"type": "text", "text": "这个场景中,根据这位男士的面部表情和身体语言,我们能推断出他的情绪状态吗?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
padding_side="left",
return_tensors="pt",
return_token_type_ids=False,
).to("cuda")
# Mammothmoda2 model generation and decoding.
with torch.inference_mode(), torch.autocast(dtype=torch.bfloat16):
generated_ids = model.generate(**inputs)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
📊 Benchmark Results
Text-to-Image
Model
Model Size
GenEval
DPGBench
Generation
SDXL
-
0.55
74.65
DALL-E 3
-
0.67
83.50
FLUX.1-dev
-
0.67
84.00
SD3.5-Medium*
-
0.65
83.86
Unified
Emu3
8B
0.66
80.60
Janus-Pro
7B
0.80
84.19
MetaQuery-XL
7B + 1.6B
0.80
82.05
UniWorld-V1
7B + 12B
0.84
81.38
Blip3-o-8B
7B + 1.4B
0.84
81.60
OmniGen2
3B + 4B
0.86
83.57
Ovis-U1
2.4B + 1.2B
0.89
83.72
UniPic2
7B + 2B
0.90
83.79
BAGEL
7B + 7B
0.88
85.07
Show-o2
7B
0.76
86.14
GPT-4o
-
0.84
86.23
MammothModa2
8B + (3B + 2B)
0.87
87.2
Note: Model sizes in “A + B” format indicate separate understanding (A) and generation (B) parameters. Models without “+” share parameters for both tasks. MammothModa2 uses a 8B + (3B + 2B) architecture, where the 8B parameters are for understanding, and the generation part consists of 3B parameters in the AR (MLLM backbone) and 2B parameters in the DiT component.
Text-to-Video (VBench 2.0)
Model
Total
Proprietary
Sora-480p
58.38
Kling1.6
59.00
Vidu Q1
62.70
Seedance 1.0 Pro
59.81
Veo3
66.72
Open Source
HunyuanVideo
55.30
Wan2.1
60.20
LongCat-Video
62.11
Mammoth2.5
61.64
Image Editing
Model
ImgEdit Avg.
GEdit-EN Overall
Proprietary
Gemini 2.5
4.30
7.17
GPT-4o
4.30
7.48
Seedream 4
4.46
7.72
Open Source
Flux-Kontext-Dev
4.09
6.53
Step1x-Edit
4.01
6.87
Mammoth2
4.06
6.82
VInO
4.18
6.88
Mammoth2.5
4.22
7.05
Video Editing
Mammoth2.5 achieves state-of-the-art video editing performance, ranking #1 on OpenVE-Bench, #1 on FiVE-Bench, and best overall on Reco-Bench.
Model
OpenVE-Bench
FiVE-Acc
Proprietary
Kling O1
3.69
-
Open Source
VACE-14B
1.65
-
Wan-Edit
-
46.97
Omni-Video2
-
73.53
VInO
3.21
-
Mammoth2.5
3.86
87.41
Acknowledgement
We are grateful to the following open-source projects:
If you find MammothModa2 useful in your research, please cite:
@article{shen2025mammothmoda2,
title={MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation},
author={Shen, Tao and Wan, Xin and Chen, Taicai and Zhang, Rui and Pan, Junwen and Lu, Dawei and Lei, Fanding and Lu, Zhilin and Yang, Yunfei and Cheng, Chen and She, Qi and Liu, Chang and Sun, Zhenbang},
journal={arXiv preprint arXiv:2511.18262},
year={2025},
url={https://arxiv.org/abs/2511.18262}
}
🎯 Join Our Team
Moderation LLM Team @ ByteDance - We’re hiring talented individuals passionate about multimodal AI, computer vision, and MLLM development!
We develop leading MLLMs for content moderation, building infrastructure including model benchmarking, data pipelines, efficient architectures, and training methodologies.
Recent Publications (2024–2026):
Pan, J., Zhang, Q., Zhang, R., Lu, M., Wan, X., Zhang, Y., Liu, C., & She, Q. (2025). TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning. ICLR 26.
Li, Y., Wang, Y., Zhu, Y., Zhao, Z., Lu, M., She, Q., & Zhang, S. (2025). BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models. ICLR 26.
Li, Z., Qian, D., Su, K., Diao, Q., Xia, X., Liu, C., … & Yuan, Z. (2025). Bindweave: Subject-consistent video generation via cross-modal integration. ICLR 26.
Zhang, Q., Cheng, A., Lu, M., Zhuo, Z., Wang, M., Cao, J., Guo, S., She, Q., & Zhang, S. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. ICCV 25.
Xie, R., Du, C., Song, P., & Liu, C. (2025). Muse-vl: Modeling unified vlm through semantic discrete encoding. ICCV 25.
Zhang, Q., Liu, M., Li, L., Lu, M., Zhang, Y., Pan, J., She, Q., & Zhang, S. (2025). Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs. NeurIPS 25.
Lin, L., Shi, D., Han, A., Chen, F., Chen, Q., Li, J., … & Gao, J. (2025). ACT as human: Multimodal large language model data annotation with critical thinking. NeurIPS 25.
Yu, S., Jin, C., Wang, H., Chen, Z., Jin, S., Zuo, Z., … & Sun, Q. (2024). Frame-voyager: Learning to query frames for video large language models. ICLR 25.
Pan, J., Zhang, R., Wan, X., Zhang, Y., Lu, M., & She, Q. (2025). Timesearch: Hierarchical video search with spotlight and reflection for human-like long video understanding. arXiv Preprint arXiv:2504.01407.
Liu, Z., Pan, J., She, Q., Gao, Y., & Xia, G. (2025). On the Faithfulness of Visual Thinking: Measurement and Enhancement. arXiv Preprint arXiv:2510.23482.
Zhang, Y., Fan, C.-K., Huang, T., Lu, M., Yu, S., Pan, J., Cheng, K., She, Q., & Zhang, S. (2025). AutoV: Learning to Retrieve Visual Prompt for Large Vision-Language Models. arXiv Preprint arXiv:2506.16112.
Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Chang Liu, Qi She, Shanghang Zhang(2025). ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better. arXiv Preprint arXiv:2511.17106.
Shi, H., Liang, J., Xie, R., Wu, X., Chen, C., & Liu, C. (2025). Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios. arXiv preprint arXiv:2505.10584.
Shen, T., Wan, X., Chen, T., Zhang, R., Pan, J., Lu, D., Lei, F., Lu, Z., Yang, Y., & Cheng, C. (2025). MammothModa2: A Unified AR-Diffusion Framework for Multimodal Understanding and Generation. arXiv Preprint arXiv:2511.18262.
Qi She, Junwen Pan, Xin Wan, Rui Zhang, Dawei Lu, Kai Huang. (2024). MammothModa: Multi-Modal Large Language Model. arXiv.
MammothModa2: A Unified AR-Diffusion Framework for Visual Understanding and Generation
🌐 Homepage | 📑 Technical Report
Introduction
MammothModa2 is a unified autoregressive-diffusion (AR-Diffusion) framework that seamlessly integrates multimodal understanding and generation within a single model. The latest version, Mammoth2.5, is powered by Qwen3-VL and a fine-grained Mixture-of-Experts (MoE) Diffusion Transformer (DiT) with 128 routed experts, bringing multimodal understanding, generation, and editing together in a single architecture. One unified model supports text-to-image, text-to-video, image editing, and video editing with state-of-the-art performance and compute-efficient inference.
Key Features:
3B per forward pass (12%). This yields over 12x faster inference than Wan2.2 A14B on a single device.🎉 News
qwen25vlbranch.Showcases
Text-to-Video Generation
Mammoth2.5 supports high-quality text-to-video generation with over 12x faster inference than dense models of comparable capacity.
Cinematic Shots
Prompt
"俯视角度,一位有着深色,略带凌乱的长卷发的年轻中国女性,佩戴着闪耀的珍珠项链和圆形金色耳环,她凌乱的头发被风吹散,她微微抬头,望向天空,神情十分哀伤,眼中含着泪水。嘴唇涂着红色口红。背景是带有华丽红色花纹的图案。画面呈现复古电影风格,色调低饱和,带着轻微柔焦,烘托情绪氛围,质感仿佛20世纪90年代的经典胶片风格,营造出怀旧且富有戏剧性的感觉。"Animal Interaction
Prompt
"A medium shot of a chameleon carefully crawling along a tree branch, its feet gripping tightly to the bark. The camera captures the slow, deliberate movements, the slight shifting of colors, and the independent movement of its eyes."Motion
Prompt
"A man wearing a black leather jacket and sunglasses rides a motorcycle down a winding mountain road, the road is carved into the mountainside, the scenery is breathtaking with steep cliffs and deep valleys, the sky is clear and blue, the camera follows the motorcycle from behind, capturing the speed and freedom of the ride, the motorcycle is sleek and black, the man's jacket flutters in the wind, the scene is exhilarating and cinematic. "Scenery
Prompt
"A man wearing a green raincoat and boots walks through a dense forest in the rain, the trees are tall and create a canopy overhead, the rain is visible as it falls through the trees, the ground is covered in fallen leaves, the scene is moody and atmospheric, captured with a handheld camera, the man is slightly hunched, protecting himself from the rain, the forest is dark and mysterious, the rain creates a peaceful ambiance."High-Speed Motion
Prompt
"In a magical, floating island world, a young adventurer with a jetpack soars through the sky, dodging floating rocks and mystical creatures. The camera follows the adventurer from behind, offering a sweeping view of the vast, interconnected islands, each with its unique terrain and ecosystem. The animation features fluid, high-speed flying sequences, with the adventurer narrowly avoiding obstacles and discovering hidden treasure."Video Editing
Mammoth2.5 achieves state-of-the-art video editing performance, ranking #1 on OpenVE-Bench, FiVE-Bench, and Reco-Bench.
Add Backpack
Transform Hand into Robotic Hand
Ghibli Style
Remove Right Person
Edit Gender
Text-to-Image & Image Editing
🪄 Models
qwen25vlbranch.⚙️ Installation
The codebase has been tested with Python 3.11.9, CUDA 12.4, and PyTorch 2.6.0. You can set up the environment using uv with the following command:
🚀 Usage
Text-to-Image Generation
Multi-modal Understanding
📊 Benchmark Results
Text-to-Image
Note: Model sizes in “A + B” format indicate separate understanding (A) and generation (B) parameters. Models without “+” share parameters for both tasks. MammothModa2 uses a 8B + (3B + 2B) architecture, where the 8B parameters are for understanding, and the generation part consists of 3B parameters in the AR (MLLM backbone) and 2B parameters in the DiT component.
Text-to-Video (VBench 2.0)
Image Editing
Video Editing
Mammoth2.5 achieves state-of-the-art video editing performance, ranking #1 on OpenVE-Bench, #1 on FiVE-Bench, and best overall on Reco-Bench.
Acknowledgement
We are grateful to the following open-source projects:
Citation
If you find MammothModa2 useful in your research, please cite:
🎯 Join Our Team
Moderation LLM Team @ ByteDance - We’re hiring talented individuals passionate about multimodal AI, computer vision, and MLLM development!
We develop leading MLLMs for content moderation, building infrastructure including model benchmarking, data pipelines, efficient architectures, and training methodologies.
Recent Publications (2024–2026):
Contact: liuchang.lab@bytedance.com