We propose OmniScient Model (OSM) towards open-ended visual recognition, allowing the identification of diverse real-world entities without the constraints of a user-defined vocabulary. Unlike closed-vocabulary and open-vocabulary recognition frameworks, OSM operates seamlessly without the need for predefined vocabularies.
Features
A simple strategy to adapt multi-modal LLM for high-resolution image at 1120x1120, leading to more precise recognition ability.
A brand-new task named open-ended visual recognition to predict beyond the limitation of a given vocabulary.
A strong model that can recognize novel concepts in the real-world, e.g., it can recognize semantic parts even when only trained on object-level data.
We provide examples applying OSM on top of an off-the-shelf segmenter (e.g., SAM), illustrating playing with OSM in a segment and recognize anything mode in demo_with_sam.py, or in an interactive model in interactive_demo.ipynb.
After finishing the data preparation, you can use the following commands to train OSM model with 8 A100 GPUs in 2 days, and you can adjust the gradient accumulation, FSDP, gradient checkpointing per your computational resources.
To train OSM-final w/o part segmentation or detection data:
Update the data path in test/generate_pred.py, then run the following script for testing:
GPU_COUNT=8 # Set your GPU count here
CKPT_PATH="./osm_final.pt" # Set your checkpoint path here
RESULT_SAVE_PATH="osm_final" # Set your result save path here
for (( i=0; i<GPU_COUNT; i++ )); do
CUDA_VISIBLE_DEVICES=$i python3 test/generate_pred.py $i $GPU_COUNT $CKPT_PATH $RESULT_SAVE_PATH &
done
wait # This will wait for all the background jobs to finish
python3 test/evaluate_pred.py $RESULT_SAVE_PATH $GPU_COUNT
If you use OSM in your research, please use the following BibTeX entry.
@inproceedings{yu2023towards,
title={Towards Open-Ended Visual Recognition with Large Language Model},
author={Qihang Yu and Xiaohui Shen and Liang-Chieh Chen},
booktitle={ECCV},
year={2024}
}
OmniScient-Model (ECCV 2024)
This repo contains the code for our paper Towards Open-Ended Visual Recognition with Large Language Model
We propose OmniScient Model (OSM) towards open-ended visual recognition, allowing the identification of diverse real-world entities without the constraints of a user-defined vocabulary. Unlike closed-vocabulary and open-vocabulary recognition frameworks, OSM operates seamlessly without the need for predefined vocabularies.
Features
A simple strategy to adapt multi-modal LLM for high-resolution image at 1120x1120, leading to more precise recognition ability.
A brand-new task named open-ended visual recognition to predict beyond the limitation of a given vocabulary.
A strong model that can recognize novel concepts in the real-world, e.g., it can recognize semantic parts even when only trained on object-level data.
Installation
Getting Started
We provide examples applying OSM on top of an off-the-shelf segmenter (e.g., SAM), illustrating playing with OSM in a segment and recognize anything mode in demo_with_sam.py, or in an interactive model in interactive_demo.ipynb.
Data Preparation
Please refer to Preparing Datasets for OSM.
Training
After finishing the data preparation, you can use the following commands to train OSM model with 8 A100 GPUs in 2 days, and you can adjust the gradient accumulation, FSDP, gradient checkpointing per your computational resources.
To train OSM-final w/o part segmentation or detection data:
To train OSM-final w/ part segmentation and detection data:
Testing
Update the data path in test/generate_pred.py, then run the following script for testing:
Model Zoo
Visual Results
Citing OSM
If you use OSM in your research, please use the following BibTeX entry.
Acknowledgement
Segment Anything
OpenFlamingo