We use the block expansion to fine-tune the VLMs. 8~16 blocks are suggested for balancing the performance and efficiency. We add 12 blcoks to the original llava-1.6-34b. the llava-1.6-34b-12block model could be created by these steps:
Download the llava-1.6-34b model to ./models, and add block with this script:
python block_expansion_llava_1_6.py
Copy the *.json and tokenizer.model form ./models/llava-v1.6-34b to ./models/llava-v1.6-34b-12block;
Modify the num_hidden_layers=72 (new_layer_nums= original_layer_nums+block_layer_nums) in config.json of the llava-1.6-34b-12block model.
Train
We use 8xA100 GPUs for fine-tuning. The training process takes approximately 8 hours by this script:
If you find CityLLaVA useful for your research and applications, please cite using this BibTeX:
@misc{duan2024cityllava,
title={CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario},
url={https://github.com/qingchunlizhi/AICITY2024_Track2_AliOpenTrek_CityLLaVA},
author={Zhizhao Duan, Hao Cheng, Duo Xu, Xi Wu, Xiangxie Zhang, Xi Ye, and Zhen Xie},
year={2024},
eprint={2405.03194},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Acknowledgement
CityLLaVA is built with reference to the code of the following projects: LLaVA and LLaMA-Pro. Thanks for their awesome work!
AICITY2024_Track2_AliOpenTrek_CityLLaVA
🏆 The 1st Place Solution to The 8th NVIDIA AI City Challenge (CVPR 2024 workshop) Track 2: CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario.
Leaderboard
Prepare
structures
Data Preparation
Firstly change the directory to
data_preprocessand create thedatadirectory.Please download the wts-dataset. Then, put the datasets under
./data. After unzip the datasets, the directory structure should be like this:Then run the following script to process the test data:
After this script is excuted, all the test data is prepared. You can download the fintuned model and run the inference step directly.
Run the following script to process the train data:
Note that the Openai or Qwen API is required in “prepare_data_train.sh”. You should modify the API_KEY in this script.
After the execution, the folder structure should be like this:
Then the processed annotations could be found under
./processed_anno, and the train json is:Block-Expansion
We use the block expansion to fine-tune the VLMs. 8~16 blocks are suggested for balancing the performance and efficiency. We add 12 blcoks to the original llava-1.6-34b. the llava-1.6-34b-12block model could be created by these steps:
./models, and add block with this script:*.jsonandtokenizer.modelform./models/llava-v1.6-34bto./models/llava-v1.6-34b-12block;num_hidden_layers=72(new_layer_nums= original_layer_nums+block_layer_nums) inconfig.jsonof the llava-1.6-34b-12block model.Train
We use 8xA100 GPUs for fine-tuning. The training process takes approximately 8 hours by this script:
The fine-tuned model could be download here.
Inference
Firstly, you should check the parameters defined at
./scripts/inference.sh, ensure that all essential files and model exist.Now you can do inference on WTS_TEST_SET:
Evaluation
We use the wts-dataset for evaluation.
Citation
If you find CityLLaVA useful for your research and applications, please cite using this BibTeX:
Acknowledgement