Groma is an MLLM with exceptional region understanding and visual grounding capabilities. It can take user-defined region inputs (boxes) as well as generate long-form responses that are grounded to visual context.
Groma presents a novel paradigm of grounded MLLMs. (a) LLM for localization (e.g., Kosmos-2, Shikra); (b) External modules for localization (e.g., Lisa); and (c) Visual tokenier for localization (Groma).
We provide instructions to download datasets used at different training stages of Groma,
including Groma Instruct,
a 30k viusally grounded conversation dataset constructed with GPT-4V.
You don’t have to download all of them unless you want to train Groma from scratch.
Please follow instructions in DATA.md to prepare datasets.
For evaluation, please refer to EVAL.md for more details.
Citation
If you find this repo useful for your research, feel free to give us a star ⭐ or cite our paper:
@article{ma2024groma,
title={Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models},
author={Ma, Chuofan and Jiang, Yi and Wu, Jiannan and Yuan, Zehuan and Qi, Xiaojuan},
journal={arXiv preprint arXiv:2404.13013},
year={2024}
}
Acknowledgement
Groma is built upon the awesome works
LLaVA and
GPT4ROI.
LICENSE
This project is licensed under the Apache License 2.0 -
see the LICENSE file for details.
Groma: Grounded Multimodal Assistant
Groma is an MLLM with exceptional region understanding and visual grounding capabilities. It can take user-defined region inputs (boxes) as well as generate long-form responses that are grounded to visual context.
Groma presents a novel paradigm of grounded MLLMs. (a) LLM for localization (e.g., Kosmos-2, Shikra); (b) External modules for localization (e.g., Lisa); and (c) Visual tokenier for localization (Groma).
Contents
Performance
State-of-the-art performance on referring expression comprehension (REC) benchmarks among multimodal large language models.
Installation
Clone the repository
Create the conda environment and install dependencies
Install falsh-attention for training
Model Weights
To play with Groma, please download the model weights from huggingface.
We additionally provide pretrained checkpoints from intermediate training stages. You can start from any point to customize training.
Prepare Data
We provide instructions to download datasets used at different training stages of Groma, including Groma Instruct, a 30k viusally grounded conversation dataset constructed with GPT-4V. You don’t have to download all of them unless you want to train Groma from scratch. Please follow instructions in DATA.md to prepare datasets.
Training
For detection pretraining, please run
For alignment pretraining, please run
For instruction finetuning, please run
Inference
To test on single image, you can run
Evaluation
For evaluation, please refer to EVAL.md for more details.
Citation
If you find this repo useful for your research, feel free to give us a star ⭐ or cite our paper:
Acknowledgement
Groma is built upon the awesome works LLaVA and GPT4ROI.
LICENSE
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.