Generative Region-Language Pretraining for Open-Ended Object Detection

Chuang Lin Yi Jiang Lizhen Qu Zehuan Yuan Jianfei Cai

Monash University ByteDance Inc.

CVPR 2024

⭐ If GenerateU is helpful to your projects, please help star this repo. Thanks! 🤗

Highlight

GenerateU is accepted by CVPR2024.
We introduce generative open-ended object detection, which is a more general and practical setting where categorical information is not explicitly defined. Such a setting is especially meaningful for scenarios where users lack precise knowledge of object cate- gories during inference.
Our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference.

Results

Zero-shot domain transfer to LVIS

pseudo-label_examples

Visualizations

👨🏻‍🎨 Pseudo-label Examples

pseudo-label_examples

🎨 Zero-shot LVIS

pseudo-label_examples

Overview

overall_structure

Dependencies and Installation

Clone Repo

git clone https://github.com/clin1223/GenerateU.git

Create Conda Environment and Install Dependencies

# create new anaconda env
conda create -n GenerateU python=3.8 -y
conda activate GenerateU

# install python dependencies
pip3 install -e . --user
pip3 install -r requirements.txt 

# compile Deformable DETR
cd projects/DDETRS/ddetrs/models/deformable_detr/ops
bash make.sh

CUDA >= 11.3
PyTorch >= 1.10.0
Torchvision >= 0.11.1
Other required packages in requirements.txt

Get Started

Prepare pretrained models

Download our pretrained models from here to the weights folder. For training, prepare the backbone weight Swin-Tiny and Swin-Large following instruction in tools/convert-pretrained-swin-model-to-d2.py

The directory structure will be arranged as:

weights
   |- vg_swinT.pth
   |- vg_swinL.pth
   |- vg_grit5m_swinT.pth
   |- vg_grit5m_swinL.pth
   |- swin_tiny_patch4_window7_224.pkl
   |- swin_large_patch4_window12_384_22k.pkl

Dataset preparation

VG Dataset

Download images from VG official website
Download our pre-processed annotations: train_from_objects.json

LVIS Dataset

Download validation images from COCO official website
Download validation annotations same as GLIP: lvis_v1_minival.json
Download LVIS category text embedding for mapping

(Optional) GrIT-20M Dataset

Download images from GrIT-20M official website
Run Evaluation on GrIT images to generate pseudo lables.

Dataset strcture should look like:

  |-- datasets
  `-- |-- vg
      |-- |-- images/
      |-- |-- train_from_objects.json
   `-- |-- lvis
      |-- |-- val2017/
      |-- |-- lvis_v1_minival.json
      |-- |-- lvis_v1_clip_a+cname_ViT-H.npy

Training

By default, we train GenerateU using 16 A100 GPUs. You can also train on a single node, but this might prevent you from reproducing the results presented in the paper.

Single-Node Training

When pretraining with VG, single node is enough. On a single node with 8 GPUs, run

python3 launch.py --nn 1 --uni 1 \
--config-file projects/DDETRS/configs/vg_swinT.yaml OUTPUT_DIR outputs/${EXP_NAME}

Multiple-Node Training

# On node 0, run
python3 launch.py --nn 2 --port <PORT> --worker_rank 0 --master_address <MASTER_ADDRESS> \
--uni 1 --config-file /path/to/config/name.yaml  OUTPUT_DIR outputs/${EXP_NAME}
# On node 1, run
python3 launch.py --nn 2 --port <PORT> --worker_rank 1 --master_address <MASTER_ADDRESS> \
--uni 1 --config-file /path/to/config/name.yaml OUTPUT_DIR outputs/${EXP_NAME}

<MASTER_ADDRESS> should be the IP address of node 0. <PORT> should be the same among multiple nodes. If <PORT> is not specifed, programm will generate a random number as <PORT>.

Evaluation

To evaluate a model with a trained/ pretrained model, run

python3 launch.py --nn 1 --eval-only --uni 1 --config-file /path/to/config/name.yaml  \
OUTPUT_DIR outputs/${EXP_NAME}  MODEL.WEIGHTS /path/to/weight.pth

Citation

If you find our repo useful for your research, please consider citing our paper:

@inproceedings{lin2024generateu,
   title={Generative Region-Language Pretraining for Open-Ended Object Detection},
   author={Chuang, Lin and Yi, Jiang and Lizhen, Qu and Zehuan, Yuan and Jianfei, Cai},
   booktitle={Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
   year={2024}
}

Contact

If you have any questions, please feel free to reach me out at chuang.lin@monash.edu.

Acknowledgement

This code is based on UNINEXT. Some code are brought from FlanT5. Thanks for their awesome works.

Special thanks to Bin Yan and Junfeng Wu for their valuable contributions.