mirrors/conv-llava

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao

Jun Song, Shiji Song, Gao Huang, Bo Zheng

[ English | 中文 ]

摘要 💡

高分辨率多模态大模型（LMM）面临视觉token过多和视觉平方复杂度的挑战。当前的高分辨率LMM通常能够解决二次复杂度问题，却会生成过量的视觉token。然而，过多的视觉token才是更关键的问题，因为它会导致更显著的计算开销。 为了解决这个问题，我们提出了ConvLLaVA，它采用层次化的主干网络ConvNeXt作为LMM的视觉编码器，以替代Vision Transformer（ViT）。ConvLLaVA将高分辨率图像压缩成富含信息的视觉特征，有效避免了生成过多的视觉token。 为了增强ConvLLaVA的能力，我们提出了两项关键优化措施。

由于低分辨率预训练的ConvNeXt在直接应用于高分辨率时表现不佳，我们更新它以弥合这一差距。
此外，由于ConvNeXt原有的压缩比对于更高分辨率的输入来说不足，我们训练了一个新的stage，以进一步压缩视觉token，有效减少冗余。

这些优化使得ConvLLaVA能够支持1536x1536分辨率的输入，同时仅生成576个视觉token，并适应任意宽高比的图像。 实验结果显示，我们的方法在主流基准测试上与最先进的模型相比取得了竞争性的性能。

LLaVA 和 ConvLLaVA 结构上的对比

如果你对多模态大模型感兴趣，或者你有很好的想法，请你联系我：Chunjiang Ge.

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

计划

Add LMMs-eval supports.
Add VLMEvalKit supports.
Add xtuner supports.
Release weights.
Release inference code.

安装

Clone this repository and navigate to ConvLLaVA folder

git clone https://github.com/alibaba/conv-llava
cd conv-llava

Install Package

conda create -n convllava python=3.11 -y
conda activate convllava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"
pip install flash-attn --no-build-isolation

模型库

我们的模型的在一些测试基准上的性能如下：

Method	Resolution	Visual Tokens	LLM	MME	MMB	SEED	RealWorldQA	MMMU	MMVet	Text	Doc	POPE
ConvLLaVA	768	144	7B	1541	68	68.8	55.9	36.3	44.8	59.1	44.8	87.3
ConvLLaVA	1024	256	7B	1553	68.8	69.3	58.8	35.1	44.4	62.5	48.5	87.7
ConvLLaVA	1536	576	7B	1575	68.7	70.2	59.9	35.8	45.9	65.8	59	87.3

Method	Resolution	Visual Tokens	LLM	RefCOCO			RefCOCO+			RefCOCOg		Avg
Method	Resolution	Visual Tokens	LLM	val	test-A	test-B	val	test-A	test-B	val	test	Avg
ConvLLaVA	768	144	7B	84.5	89.0	79.2	77.7	84.9	69.7	79.8	79.7	80.6
ConvLLaVA	1024	256	7B	85.5	89.6	78.8	79.3	86.1	70.3	80.6	81.2	81.4
ConvLLaVA	1536	576	7B	86.5	90.6	80.5	80.0	86.8	71.5	82.0	82.4	82.3

我们的 Model Zoo 中包含了主要的权重和下载方式，并有说明如何使用这些权重。

数据集

我们实验用到的数据在 Data.md 中有介绍。

训练

训练的超参数如下：

Hyperparameters	Stage 1	Stage 2	Stage 3
Learning Rate	3e-4	2e-5	2e-5
Batch Size	256	256	128
Epochs	1	1	1
Warmup Ratio	0.03	0.03	0.03
Weight Decay	0	0	0
Optimizer	AdamW	AdamW	AdamW

训练脚本在文件夹 scripts 中:

Projector Initialzation: stage1
Vision Language Pretraining: stage2
Instruction Tuning: stage3

评测

我们目前支持 VLMEVALKIT 和 lmms-eval 来测试模型。请看 Evaluation.md 了解更多细节.

引用

如果你认为我们的工作有所帮助，请你通过下面的 BibTeX 来引用我们的工作:

@misc{ge2024convllava,
    title={ConvLLaVA: Hierarchical Backbones as Visual
Encoder for Large Multimodal Models},
    author={Chunjiang Ge, Sijie Cheng, Ziming Wang, Jiale Yuan, Yuan Gao, Jun Song, Shiji Song, Gao Huang, Bo Zheng},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
    year={2024}
    eprint={2045.15738},
}

致谢

Vicuna: the codebase LLaVA built upon, and our base model Vicuna-13B that has the amazing language capabilities!
LLaVA: the codebase we built upon.