iBOT is a novel self-supervised pre-training framework that performs masked image modeling with self-distillation. iBOT pre-trained model shows local semantic features, which helps the model transfer well to downstream tasks both at a global scale and a local scale. For example, iBOT achieves strong performance on COCO object detection (51.2 box AP and 44.2 mask AP) and ADE20K semantic segmentation (50.0 mIoU) with vanilla ViT-B/16. iBOT can also extract semantic-meaningful local parts, like dog’s ear .
News 🎉
January 2022 - The paper is accepted by ICLR 2022.
Update - ViT-L/16 with ImageNet-1K pre-training achieves 81.0% in linear probing accuracy. ViT-L/16 with ImageNet-22K pre-training achieves 87.8% in 512x fine-tuning accuracy.
Update - Random masking with a relatively larger prediction ratio performs slighly better than block-wise masking. For example, ViT-B/16 achieves an 84.1% fine-tuning accuracy and a 51.5 box AP in object detection.
December 2021 - Release the code and pre-trained models.
We provide run.sh with which you can complete the pre-training + fine-tuning experiment cycle in an one-line command.
Arguments
TYPE is named by the rule of dataset_task. For example, pre-training on ImageNet-1K has a TYPE of imagenet_pretrain and linear probing evalution on ImageNet-1K has a TYPE of imagenet_linear. Different types of task can be appended in one command.
JOB_NAME is the customized job name to distinguish from different groups of experiments.
ARCH is the architecture of the pre-trained models.
KEY chooses which pre-trained model to be evaluated and can be set as either teacher (generally better) or student for one model.
GPUS is GPUs needed for each node, and will be clamped by MAX_GPUS (default as 8).
Other additional arguments can directly appended after these required ones. For example, --lr 0.001.
For example, the following command will automatically evaluate the models on K-NN and linear probing benchmark after the pre-training with student and teacher model distributed across 2 nodes:
TOTAL_NODES=2 NODE_ID=0 ./run.sh imagenet_pretrain+imagenet_knn+imagenet_linear vit_small student,teacher 16 // the first node
TOTAL_NODES=2 NODE_ID=1 ./run.sh imagenet_pretrain+imagenet_knn+imagenet_linear vit_small student,teacher 16 // the second node
Training
For a glimpse at the full documentation of iBOT pre-training, please run:
python main_ibot.py --help
iBOT Pre-Training with ViTs
To start the iBOT pre-training with Vision Transformer (ViT), simply run the following commands. JOB_NAME is a customized argument to distinguish different experiments and this will automatically save checkpoints into the seperate folders.
The exact arguments to reproduce the models presented in our paper can be found in the args column of the pre-trained models. We also provide the logs for pre-training to help reproducibility.
For example, run iBOT with ViT-S/16 network on two nodes with 8 GPUs for 800 epochs with the following command. The resulting checkpoint should reach 75.2% on k-NN accuracy, 77.9% on linear probing accuracy, and 82.3% on fine-tuning accuracy.
For example, run iBOT with Swin-T/14 network on five nodes with 8 GPUS for 300 epochs with the following command. The resulting checkpoint should reach 76.2% on k-NN accuracy, 79.3% on linear probing accuracy.
You can choose to download only the weights of the pre-trained backbone used for downstream tasks, and the full ckpt which contains backbone and projection head weights for both student and teacher networks. For the backbone, s denotes that the student network is selected while t denotes that the teacher network is selected. PS denotes prediction shape.
or extracting sparse correspondence pairs between two images:
We also provide a Colab page you can play around with iBOT pre-trained models.
Extracting Semantic Patterns
We extract top-k numbered local classes based on patch tokens with their corresponding patches and contexts by running the following command. We indentify very diverse behaviour like shared low-level textures and high-level semantics.
The script also supports to extract the patern layout on the [CLS] token, which is actually doing clustering or unsupervised classification. This property is not induced by MIM objective since we also spot this feature on DINO.
This repository is built using the DINO repository and the BEiT repository.
License
This repository is released under the Apache 2.0 license as found in the LICENSE file.
Citing iBOT
If you find this repository useful, please consider giving a star and citation:
@article{zhou2021ibot,
title={iBOT: Image BERT Pre-Training with Online Tokenizer},
author={Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
journal={International Conference on Learning Representations (ICLR)},
year={2022}
}
Image BERT Pre-Training with iBOT
Official PyTorch implementation and pre-trained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.
[
arXiv] [Colab] [BibTex]iBOT is a novel self-supervised pre-training framework that performs masked image modeling with self-distillation. iBOT pre-trained model shows local semantic features, which helps the model transfer well to downstream tasks both at a global scale and a local scale. For example, iBOT achieves strong performance on COCO object detection (51.2 box AP and 44.2 mask AP) and ADE20K semantic segmentation (50.0 mIoU) with vanilla ViT-B/16. iBOT can also extract semantic-meaningful local parts, like dog’s ear
.
News 🎉
Installation
See installation structions for details.
One-Line Command by Using
run.shWe provide
run.shwith which you can complete the pre-training + fine-tuning experiment cycle in an one-line command.Arguments
TYPEis named by the rule of dataset_task. For example, pre-training on ImageNet-1K has a TYPE of imagenet_pretrain and linear probing evalution on ImageNet-1K has a TYPE of imagenet_linear. Different types of task can be appended in one command.JOB_NAMEis the customized job name to distinguish from different groups of experiments.ARCHis the architecture of the pre-trained models.KEYchooses which pre-trained model to be evaluated and can be set as either teacher (generally better) or student for one model.GPUSis GPUs needed for each node, and will be clamped byMAX_GPUS(default as 8).--lr 0.001.For example, the following command will automatically evaluate the models on K-NN and linear probing benchmark after the pre-training with
studentandteachermodel distributed across 2 nodes:Training
For a glimpse at the full documentation of iBOT pre-training, please run:
iBOT Pre-Training with ViTs
To start the iBOT pre-training with Vision Transformer (ViT), simply run the following commands.
JOB_NAMEis a customized argument to distinguish different experiments and this will automatically save checkpoints into the seperate folders.The exact arguments to reproduce the models presented in our paper can be found in the
argscolumn of the pre-trained models. We also provide the logs for pre-training to help reproducibility.For example, run iBOT with ViT-S/16 network on two nodes with 8 GPUs for 800 epochs with the following command. The resulting checkpoint should reach 75.2% on k-NN accuracy, 77.9% on linear probing accuracy, and 82.3% on fine-tuning accuracy.
iBOT Pre-Training with Swins
This code also works for training iBOT on Swin Transformer (Swin). In the paper, we only conduct experiments on Swin-T with different window sizes:
For example, run iBOT with Swin-T/14 network on five nodes with 8 GPUS for 300 epochs with the following command. The resulting checkpoint should reach 76.2% on k-NN accuracy, 79.3% on linear probing accuracy.
Pre-Trained Models
You can choose to download only the weights of the pre-trained
backboneused for downstream tasks, and thefull ckptwhich contains backbone and projection head weights for both student and teacher networks. For thebackbone,sdenotes that the student network is selected whiletdenotes that the teacher network is selected.PSdenotes prediction shape.We also provide the ViT-{B,L}/16 model pre-trained on ImageNet-22K dataset.
To extract the backbone from the full checkpoint by yourself, please run the following command where
KEYbeing either student or teacher.Downstream Evaluation
See Evaluating iBOT on Downstream Tasks for details.
Property Analysis
See Analyzing iBOT’s Properties for robustness test and visualizing self-attention map:
or extracting sparse correspondence pairs between two images:
We also provide a Colab page
you can play around with iBOT pre-trained models.
Extracting Semantic Patterns
We extract top-k numbered local classes based on patch tokens with their corresponding patches and contexts by running the following command. We indentify very diverse behaviour like shared low-level textures and high-level semantics.
The script also supports to extract the patern layout on the [CLS] token, which is actually doing clustering or unsupervised classification. This property is not induced by MIM objective since we also spot this feature on DINO.
Acknowledgement
This repository is built using the DINO repository and the BEiT repository.
License
This repository is released under the Apache 2.0 license as found in the LICENSE file.
Citing iBOT
If you find this repository useful, please consider giving a star
and citation: