DQ is able to generate condensed small datasets for training unseen network architectures with state-of-the-art compression ratios for lossless model training.
We support both vision and language dataset compression:
Vision tasks: with 60% data from ImageNet, the models can be trained with no performance drop including classification, semantic segmentation, and object detection.
Language tasks: with 20% data from Alpaca’s instruction tuning data, the models can be trained with negligible performance on BBH, DROP, MMLU, and Human-Eval.
TODO List
ImageNet selected indices
Getting Started
Download the repo:
git clone https://github.com/magic-research/Dataset_Quantization.git
cd Dataset_Quantization
Dataset Quantization is conducted in the following steps:
Dataset bin generation. Firstly we iteratively select non-overlapping dataset bins according to the submodular function.
Bin sampling. Then we uniformly sample a certain portion (the required data keep ratio) from each bin and form the final compact set.
Pixel quantization and reconstruction. We employ a GradCAM module to select informative image patches. By only storing the informative patches, the required storage can be further reduced. An MAE model is adopted for image reconstruction. For simplicity, here we directly conduct the reconstruction for evaluating our full method.
We use timm for evaluating the quantized ImageNet data.
For more instructions you can refer to the README inside the pytorch_image_models folder or the official timm repo.
Here we provide the code of DQ to compress the instruction fine-tunning datasets alpaca, which consists of 52K instructions. To compress the dataset, we first extract the embeddings for each instruction with response by OpenAI Embedding API, and then use DQ to sample a fraction of dataset.
Embedding Extraction
The extracted embeddings can be downloaded from this link. The generation fees are smaller than $1.
Optionally, you can generate the embeddings by yourself with your OPENAI key, the --index and --nums can be used for parallelization.
Then, you can merge the embeddings with the following command:
python alpaca_embed.py --merge
DQ Sampling
To generate the sampled dataset, you can run the following command:
python alpaca_sample.py --ratio 0.1 --k 10
For your reference, we provided some sampled results in the data/alpaca folder.
Training
We use stanford_alpaca to finetune the 7B llama model. Please follow the instructions in the repo to run the finetuning. For the 1k sampled dataset, we use the following command to finetune the model. The hyper-parameter comes from the LIMA paper.
We would like to especially thank Zangwei Zheng for his help on the implementation of DQ in language tasks and Ge Yan for his advice on the mathematical proof of the submodular part.
Citation
If you find this work helpful, please cite:
@article{zhou2023dataset,
title={Dataset Quantization},
author={Zhou, Daquan and Wang, Kai and Gu, Jianyang and Peng, Xiangyu and Lian, Dongze and Zhang, Yifan and You, Yang and Feng, Jiashi},
journal={arXiv preprint arXiv:2308.10524},
year={2023}
}
Dataset Quantization for both Vision and Language Tasks
Official implementation of “Dataset Quantization“.
Dataset Quantization
Daquan Zhou*, Kai Wang*, Jianyang Gu*, Xiangyu Peng, Dongze Lian, Yifan Zhang, Yang You+, Jiashi Feng+ (*Equal Contribution, +Corresponding Author)
Highlight ✨
TODO List
Getting Started
Download the repo:
Set up the environment:
prepare the pretrained MAE model for the image reconstruction.
DQ for Image Classification
Overview
Dataset Quantization is conducted in the following steps:
Data Preparation
~/data_cifar.~/data_imagenet.Quantization
CIFAR-10
We have provided the selected 12.5% indices in the
data/cifar10folder, which can be directly used for evaluation.ImageNet
Training
For data keep ratio higher than 10%, we use a batch size of 128. Otherwise, we use a batch size of 16 for sufficient training.
Note that the final data keep ratio is the multiplication of the fraction in bin sampling and the patch keep ratio in image reconstruction.
CIFAR-10
ImageNet
We use
timmfor evaluating the quantized ImageNet data. For more instructions you can refer to the README inside the pytorch_image_models folder or the official timm repo.DQ for Instruction Fine-tuning
Here we provide the code of DQ to compress the instruction fine-tunning datasets alpaca, which consists of 52K instructions. To compress the dataset, we first extract the embeddings for each instruction with response by OpenAI Embedding API, and then use DQ to sample a fraction of dataset.
Embedding Extraction
The extracted embeddings can be downloaded from this link. The generation fees are smaller than $1.
Optionally, you can generate the embeddings by yourself with your OPENAI key, the
--indexand--numscan be used for parallelization.Then, you can merge the embeddings with the following command:
DQ Sampling
To generate the sampled dataset, you can run the following command:
For your reference, we provided some sampled results in the
data/alpacafolder.Training
We use stanford_alpaca to finetune the 7B llama model. Please follow the instructions in the repo to run the finetuning. For the 1k sampled dataset, we use the following command to finetune the model. The hyper-parameter comes from the LIMA paper.
Evaluation
We use instruct-eval repo to evaluate the finetuned model. Please follow the instructions in the repo to run the evaluation.
Acknowledgement
This project is mainly developed based on the following repos:
We would like to especially thank Zangwei Zheng for his help on the implementation of DQ in language tasks and Ge Yan for his advice on the mathematical proof of the submodular part.
Citation
If you find this work helpful, please cite: