🔥 We identify the foreground bias issue in existing VLMs and propose region-text alignment by incorporating explicit semantic structuring through category guidance.
🔥 We propose DenseVLM, a region-language alignment framework that uses a strong VLM to retrieve categories for unlabeled regions and decouples foreground and background features to reduce bias.
🔥 Extensive experiments on dense prediction benchmarks show that our DenseVLM outperforms previous methods and exhibits promising scalability.
Overview
DenseVLM is an annotation-free fine-tuning framework for open-vocabulary dense prediction tasks, which retrieves region-level semantics from a powerful vision-language model and decouples foreground and background features to achieve unbiased region-language alignment and improved open-vocabulary dense prediction.
TODO
Release the training and inference code of DenseVLM.
Supports training and inference code for RegionCLIP and CLIPSelf.
Release the code to integrate DenseVLM into CAT-Seg.
Release the code to integrate DenseVLM into F-ViT.
Quick Start
🚀 Linux system with CUDA 11.8
🚀 At least one RTX 3090 GPU (4 GPUs are default for training ~23min/epoch)
1. Create Conda Environment
The provided environment is suggested for reproducing our results, similar configurations may also work.
If you find this project useful, please consider citing:
@article{li2024densevlm,
title={Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction},
author={Li, Yunheng and Li, Yuxuan and Zeng, Quansheng and Wang, Wenhai and Hou, Qibin and Cheng, Ming-Ming},
journal={arXiv preprint arXiv:2412.06244},
year={2024}
}
@InProceedings{li2024cascadeclip,
title={Cascade-{CLIP}: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation},
author={Li, Yunheng and Li, Zhong-Yu and Zeng, Quan-Sheng and Hou, Qibin and Cheng, Ming-Ming},
booktitle={Proceedings of the 41st International Conference on Machine Learning},
pages={28243--28258},
year={2024},
volume={235},
month={21--27 Jul},
publisher={PMLR}
}
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Yunheng Li · Yuxuan Li · Quansheng Zeng · Wenhai Wang · Qibin Hou† · Ming-Ming Cheng
Accepted By ICCV 2025!
[Paper] [Github] [Pretrained models]
Contributions
Overview
DenseVLM is an annotation-free fine-tuning framework for open-vocabulary dense prediction tasks, which retrieves region-level semantics from a powerful vision-language model and decouples foreground and background features to achieve unbiased region-language alignment and improved open-vocabulary dense prediction.
TODO
Quick Start
1. Create Conda Environment
2. Data Preparation
The main experiments are conducted using images from COCO and ADE20k datasets. Please prepare datasets and organize them like the following:
3. Checkpoints
Please download the pretrained weights from huggingface and organize them like the
If using a fine-tuned CLIP, you can directly use it. For example:
4. Training and Testing
To fine-tune the CLIP model using densevlm, run:
To evaluate the CLIP model fine-tuned with densevlm, run:
🙏 Citation:
If you find this project useful, please consider citing:
License
Licensed under a Creative Commons Attribution-NonCommercial 4.0 International for Non-commercial use only. Any commercial use should get formal permission first.