Update README.md
A multi-modal LLM capable of jointly understanding of text, vision and audio and grounding knowledge into visual objects.
[Project Page] [Arxiv] [Demo Video] [Gradio] [Data] [Model]
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs Yang Zhao*, Zhijie Lin*, Daquan Zhou, Zilong Huang, Jiashi Feng and Bingyi Kang† (*Equal Contribution, †Project Lead) Bytedance Inc.
2023/07/21 - Huggingface demo released!
Clone this repository and navigate to the current folder.
Our code is based on Python 3.9, CUDA 11.7 and Pytorch 2.0.1.
pip3 install -r pre-requirements.txt pip3 install -r requirements.txt
Follow the instruction to prepare the pretrained Vicuna weights, and update the llama_model in bubogpt/configs/models/mmgpt4.yaml.
llama_model
bubogpt/configs/models/mmgpt4.yaml
## get pre-trained checkpoints mkdir checkpoints && cd checkpoints; wget https://huggingface.co/spaces/Vision-CAIR/minigpt4/resolve/main/blip2_pretrained_flant5xxl.pth; wget https://huggingface.co/spaces/xinyu1205/recognize-anything/resolve/main/ram_swin_large_14m.pth; wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth; wget https://huggingface.co/spaces/abhishek/StableSAM/resolve/main/sam_vit_h_4b8939.pth; wget https://huggingface.co/magicr/BuboGPT-ckpt/resolve/main/bubogpt_7b.pth
For training, down load MiniGPT-4 checkpoint to checkpoints.
checkpoints
Run gradio demo with:
python3 app.py --cfg-path eval_configs/mmgpt4_eval.yaml --gpu-id 0
Browse the dataset config folder, and replace the storage item with path/to/your/data for each dataset.
storage
path/to/your/data
Stage 1: Audio pre-training
bash dist_train.sh train_configs/mmgpt4_stage1_audio.yaml
Stage2: Multi-modal instruct tuning
path/to/stage1/ckpt
ckpt
bash dist_train.sh train_configs/mmgpt4_stage2_mm.yaml
For more demonstrations, please refer to the examples.
This codebase is mainly developed based on the following repos:
版权所有:中国计算机学会技术支持:开源发展技术委员会 京ICP备13000930号-9 京公网安备 11010802032778号
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
A multi-modal LLM capable of jointly understanding of text, vision and audio and grounding knowledge into visual objects.
[Project Page] [Arxiv] [Demo Video] [Gradio] [Data] [Model]
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
Yang Zhao*, Zhijie Lin*, Daquan Zhou, Zilong Huang, Jiashi Feng and Bingyi Kang† (*Equal Contribution, †Project Lead)
Bytedance Inc.
News🔥
2023/07/21 - Huggingface demo released!
Setup
Clone this repository and navigate to the current folder.
Environment
Our code is based on Python 3.9, CUDA 11.7 and Pytorch 2.0.1.
Models
Follow the instruction to prepare the pretrained Vicuna weights, and update the
llama_modelinbubogpt/configs/models/mmgpt4.yaml.For training, down load MiniGPT-4 checkpoint to
checkpoints.Data
Stage1
Stage2
Usage
Gradio demo
Run gradio demo with:
Training
Browse the dataset config folder, and replace the
storageitem withpath/to/your/datafor each dataset.Stage 1: Audio pre-training
Stage2: Multi-modal instruct tuning
path/to/stage1/ckpttockptin train_configs/mmgpt4_stage2_mm.yamlDemo
1. Image Understanding with Grounding
2. Audio Understanding
3. Aligned Audio-Image Understanding
4. Arbitrary Audio-Image Understanding
For more demonstrations, please refer to the examples.
Acknowledgement
This codebase is mainly developed based on the following repos: