@article{wang2021self,
title={Self-Supervised Learning by Estimating Twin Class Distributions},
author={Wang, Feng and Kong, Tao and Zhang, Rufeng and Liu, Huaping and Li, Hang},
journal={arXiv preprint arXiv:2110.07402},
year={2021}
}
TWIST is a novel self-supervised representation learning method by classifying large-scale unlabeled datasets in an end-to-end way. We employ a siamese network terminated by a softmax operation to produce twin class distributions of two augmented images. Without supervision, we enforce the class distributions of different augmentations to be consistent. In the meantime, we regularize the class distributions to make them sharp and diverse. TWIST can naturally avoid trivial solutions without specific designs such as asymmetric network, stop-gradient operation, or momentum encoder
Updates
12/4/2022: The performances of ViT-S (DeiT-S) and ViT-B are improved (+0.7 and +1.1 respectively), which is achieved by changing hyper-parameters (reduce the batch-size from 2048 to 1024, and change the drop path rate from 0.0 to 0.1 for ViT-B).
TWIST: Self-Supervised Learning by Estimating Twin Class Distributions
Codes and pretrained models for TWIST:
TWIST is a novel self-supervised representation learning method by classifying large-scale unlabeled datasets in an end-to-end way. We employ a siamese network terminated by a softmax operation to produce twin class distributions of two augmented images. Without supervision, we enforce the class distributions of different augmentations to be consistent. In the meantime, we regularize the class distributions to make them sharp and diverse. TWIST can naturally avoid trivial solutions without specific designs such as asymmetric network, stop-gradient operation, or momentum encoder
Updates
Models and Results
Main Models for Representation Learning
Model for unsupervised classification
Top-3 predictions for unsupervised classification
Semi-Supervised Results
Detection Results
Single-node Training
ResNet-50 (requires 8 GPUs, Top-1 Linear 72.6%)
Multi-node Training
ResNet-50 (requires 16 GPUs spliting over 2 nodes for multi-crop training, Top-1 Linear 75.5%)
ResNet-50w2 (requires 32 GPUs spliting over 4 nodes for multi-crop training, Top-1 Linear 77.7%)
DeiT-S (requires 16 GPUs spliting over 2 nodes for multi-crop training, Top-1 Linear 75.6%)
ViT-B (requires 32 GPUs spliting over 4 nodes for multi-crop training, Top-1 Linear 77.3%)
Linear Classification
For ResNet-50
For DeiT-S
For ViT-B
Semi-supervised Learning
Command for training semi-supervised classification
1 Percent (61.5%)
10 Percent (71.7%)
100 Percent (78.4%)
Detection
Instruction
Install detectron2.
Convert a pre-trained MoCo model to detectron2’s format:
Put dataset under “detection/datasets” directory, following the directory structure requried by detectron2.
Training: VOC
COCO