Abstract: Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment principle during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. Combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively.
The pre-trained CLIP can be downloaded automatically from huggingface.
Mask generators
All models can be automatically downloaded during runtime. If there are network issues in the runtime environment, you can expand the following table and download the models in the url column to the path location.
Except for the asterisk-marked(*) url, all the other urls are from the original repository
MaskCLIP++ models
Our model can be combined with the previous mask generator to achieve better open-vocabulary image segmentation performance. The following checkpoint should be manually downloaded to a local path.
source eval_mask_acc.sh
eval_mask_acc_ade150 $config $ckpt $ngpu $tag 1
# $ngpu is an integer representing the number of GPUs in use.
# $tag is the name of a run.
# more options can be found in eval_mask_acc.sh
source eval_all.sh
eval_ade150 $config $ckpt $ngpu $tag
# $ngpu is an integer representing the number of GPUs in use.
# $tag is the name of a run.
# Other options include: eval_ade847, eval_ctx459, eval_ctx59, eval_pc20
Fine-tuning
For base/large sized CLIPs, the fine-tuning requires about 2-4 hours on 2x NVIDIA 24G 3090 GPUs.
@misc{zeng2025maskclippp,
title={High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation},
author={Quan-Sheng Zeng and Yunheng Li and Daquan Zhou and Guanbin Li and Qibin Hou and Ming-Ming Cheng},
year={2025},
eprint={2412.11464},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.11464},
}
Acknowledgement
Thanks to the following open source code and models:
MaskCLIP++: High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation
News
Introduction
This repo contains the code for our paper.
Abstract: Open-vocabulary image segmentation has been advanced through the synergy between mask generators and vision-language models like Contrastive Language-Image Pre-training (CLIP). Previous approaches focus on generating masks while aligning mask features with text embeddings during training. In this paper, we observe that relying on generated low-quality masks can weaken the alignment of vision and language in regional representations. This motivates us to present a new fine-tuning framework, named MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. Due to the limited diversity of image segmentation datasets with mask annotations, we propose incorporating a consistency alignment principle during fine-tuning, which alleviates categorical bias toward the fine-tuning dataset. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. Combining with the mask generator in previous state-of-the-art mask-based open vocabulary segmentation methods, we achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets, respectively.
Installation
See installation instructions.
Preparations
Datasets
See Preparing Datasets for MaskCLIP++.
Pretrained CLIP models
The pre-trained CLIP can be downloaded automatically from huggingface.
Mask generators
All models can be automatically downloaded during runtime. If there are network issues in the runtime environment, you can expand the following table and download the models in the
urlcolumn to thepathlocation.Unfold
output/ckpts/mask2former/coco/pan/maskformer2_swin_tiny_bs16_50ep_final_9fd0ae.pkloutput/ckpts/mask2former/coco/pan/maskformer2_swin_large_IN21k_384_bs16_100ep_final_f07440.pkloutput/ckpts/fcclip/fcclip_coco-pan_clip-convnext-base.pthoutput/ckpts/fcclip/fcclip_coco-pan_clip-convnext-large.pthoutput/ckpts/maftp/maftp_b.pthoutput/ckpts/maftp/maftp_l.pthoutput/ckpts/maftp/maftp_l_pano.pthMaskCLIP++ models
Our model can be combined with the previous mask generator to achieve better open-vocabulary image segmentation performance. The following checkpoint should be manually downloaded to a local path.
Our best checkpoint: EVA02 CLIP-L-14, finetuned both CLIP-V and CLIP-T, on COCO Stuff
(i) Sem. Seg
(ii) Panopt. Seg and Inst. Seg
Other models
Usage
Demo
Use the demo of MaskCLIP++.
Evaluation
Mask Classification Evaluation
eval_mask_acc_xxx $config "\"\"" $ngpu $tag 0eval_mask_acc_xxx $config $ckpt $ngpu $tag 1OVS Evaluation
Fine-tuning
For base/large sized CLIPs, the fine-tuning requires about 2-4 hours on 2x NVIDIA 24G 3090 GPUs.
Citing MaskCLIP++
Acknowledgement
Thanks to the following open source code and models: