Open-VQA annotations file is under the path data/Open_VQA_images.jsonl and data/Open_VQA_videos.jsonl, there is an example:
{
"dataset": "Open_VQA_images", # the dataset name of your data
"question": "What is in the image?",
"answer": ["platform or tunnel"], # list
"index": 1,
"image": "images/places365/val_256/Places365_val_00000698.jpg", # relative path of image
"origin_dataset": "places365",
"class": "Place", # eight image VQA types and two video VQA types correspond to the open_VQA dataset
}
You can also convert your own data in jsonl format, the keys origin_dataset and class are optional.
You need the check some import settings in the configs configs/LYNX.yaml, for example:
# change this prompt for different task, this is the default prompt
prompt: "User: {question}\nBot:"
# the key must match the vision key in test_files
# if you test Open_VQA_videos.jsonl, need to change to "video"
vision_prompt_dict: "image"
output_prompt_dict: "answer"
prepare checkpoint
step 1: download the eva_vit_1b on official website and put it under the data/, rename it as eva_vit_g.pth
step 2: prepare the vicuna-7b and put it under the data/
run python -m fastchat.model.apply_delta --base /path/to/llama-7b-hf/ --target ./data/vicuna-7b/ --delta /path/to/vicuna-7b-delta-v1.1/
step 3: download the pretrain_lynx.pt or finetune_lynx.pt and put it under the data/(please check the checkpoint in the config is match the file you download.)
If you find this repository useful, please considering giving ⭐ or citing:
@article{zeng2023matters,
title={What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?},
author={Zeng, Yan and Zhang, Hanbo and Zheng, Jiani and Xia, Jiangnan and Wei, Guoqiang and Wei, Yang and Zhang, Yuchen and Kong, Tao},
journal={arXiv preprint arXiv:2307.02469},
year={2023}
}
Contact
For issues using this code, please submit a GitHub issue.
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng*, Hanbo Zhang*, Jiani Zheng*, Jiangnan Xia, Guoqiang Wei, Yang Wei, Yuchen Zhang, Tao Kong
*Equal Contribution
update
Lynx (8B parameters):
results on Open-VQA image testsets
results on Open-VQA video testsets && OwlEval human eval && MME benchmark
ablation result
Quick Start
environment
prepare data
step 1: prepare annotation file
Open-VQA annotations file is under the path
data/Open_VQA_images.jsonlanddata/Open_VQA_videos.jsonl, there is an example:You can also convert your own data in jsonl format, the keys
origin_datasetandclassare optional.step 2: prepare images
Download raw images from corresponding websites: Places365(256x256), VQAv2, OCRVQA, Something-Something-v.2, MSVD-QA, NeXT-QA and MSRVTT-QA.
step 3: modify the default setting in the code
You need the check some import settings in the configs
configs/LYNX.yaml, for example:prepare checkpoint
eva_vit_1bon official website and put it under thedata/, rename it aseva_vit_g.pthvicuna-7band put it under thedata/LLaMA-7bfrom here or from the Internet.pip install git+https://github.com/lm-sys/FastChat.gitpython -m fastchat.model.apply_delta --base /path/to/llama-7b-hf/ --target ./data/vicuna-7b/ --delta /path/to/vicuna-7b-delta-v1.1/data/(please check thecheckpointin the config is match the file you download.)organize the files like this:
infer
Citation
If you find this repository useful, please considering giving ⭐ or citing:
Contact
For issues using this code, please submit a GitHub issue.
License
This project is licensed under the Apache-2.0 License.