The code of our work “ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models”.
Overview
Vision-Language Models (VLMs) are increasingly used in Retrieval-Augmented Generation (RAG) systems for multimodal conversations. While these models can reference textual sources, they often fail to incorporate contextually relevant images into their responses. This repository addresses this gap by introducing ImageRef-VL, the ability to reference images based on conversation context, along with a comprehensive evaluation framework featuring a curated dataset and metrics. We include implementation of ImageRef-VL in this repository.
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the InternVL project.
ImageRef-VL
The code of our work “ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models”.
Overview
Vision-Language Models (VLMs) are increasingly used in Retrieval-Augmented Generation (RAG) systems for multimodal conversations. While these models can reference textual sources, they often fail to incorporate contextually relevant images into their responses. This repository addresses this gap by introducing ImageRef-VL, the ability to reference images based on conversation context, along with a comprehensive evaluation framework featuring a curated dataset and metrics. We include implementation of ImageRef-VL in this repository.
Requirements
Install the following python requirements.
We recommend conducting experiments on nodes equipped with 16 A100 GPUs for training the models.
How to use
Construct training dataset
Put your instruction dataset (with crawled documents) under
data/pathGenerate textural response
Generate image captions ```bash python inference/caption/run.py –root_model_path root/model/path –model_name model_name –method_name caption-in-context –data_dir data/dir –output_dir output/dir –event_file event/file/path –example_file example/file/path –output_caption_file output/caption/file/path
–label_file label/file/path –batch_news_size 100 –num_ic_examples 2 –load_type [api|vllm|hg]
Train
Running the following commands to train ImageRef-VL models.
Evaluation
Text Evaluation
Generation
LLM-as-judge ```bash
Generate llm-as-judge evaluation
python test/imageref/llm_as_judge.py –eval_model eval_llm_name –annotation_file test/annotation/file/path –response_file response/file/path –wcap –output_file evaluation/file/path
Calculate score
python test/imageref/judge_rslt.py –rslt_file evaluation/file/path
License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the InternVL project.