This is the package of Mutual Information Co-training (MICO) for End2End topic sharding. MICO uses BERT to generate sentence representations, and performs query routing and document assignment with the representations. The document assignment module in MICO outputs almost equal-sized clusters, and the query routing module routes the queries to the cluster containing most (if not all) of its relevant documents. MICO achieves very high performance for topic sharding.
This package can be tested through the example usage below.
Usage
You can save the command below as a bash file and run it in the current folder. You can also find and run it in ./example/scripts/run_mico.sh. It will take less than 5 minutes to finish running.
The results will be saved in ./results/. In the folder example_pair_BERT-finetune_layer-1_CLS_TOKEN_maxlen64_bs64_lr-bert5e-6_lr2e-4_warmup1000_entropy5_seed1 for this example experiment, we can see the final evaluation metrics saved in metrics.json. The document assigned to the clusters are saved in clustered_docs.json in a dictionary. The log files for training and evaluation are *.log. The model is saved as *.pt. The folder ./log contains Tensorboard results for visualization.
The dataset_name in the training command is set as example since we have an example dataset saved in ../example/data/example_dataset/. You can change the train_folder_path and test_folder_path according to your needs.
During training, the batch_size is for each GPU card. If the current choice of batch_size is good on a machine with one GPU, we do not need to change it when switching to machines with more than one GPU (each with the same GPU memory). This is because we use the DistributedDataParallel function in PyTorch to support multi-GPU training: we assign one sub-process for each GPU and it maintains its own dataloader and counts its own epoch number (hence people usually focus on the iteration number instead of the epoch number). For a 4-GPU machine, finishing one epoch for each process means training the model for 4 epochs in total. For a GPU with 16GB memory, setting batch_size=64 is good for the first try.
During testing, we use DataParallel in PyTorch for better efficiency (we only go through the dataset once with multi-GPU, much less than using DistributedDataParallel), and the batch_size is across all GPUs. Usually for testing, you can set a much larger batch_size than the one used in training, e.g., for four GPUs (each with 16GB memory), we can use batch_size=2048. You can also test the trained model directly by setting --eval_only.
To visualize the curves of the metrics calculated during training and evaluation, please use Tensorboard (for Pytorch we use TensorboardX which is installed in the setting up section.)
The results for each experiment is saved in the folder specified by --model_path in the bash commands. We also have log files in text format in that folder. After running the following command, you can open your browser and type localhost:14095 to view the training results.
Although we have adopted several techniques to decrease the memory usage, it is still possible that one encounters memory problem when running with large scale dataset. You can try this memory profiling method to estimate how much memory you will need for running MICO.
Some tips:
Setting num_worker=0 is a good way to save memory and it almost does not affect the training speed.
Running MICO on more GPUs will create more sub-process automatically, and each sub-process may consume much memory. Therefore, the memory usage increases linearly with the GPU number. If needed, you can set export CUDA_VISIBLE_DEVICES=0 to only use 1 GPU in training to save memory.
To use the memory profiling method below, please make sure that the python package memory_profiler is installed. (If not, you can install it with pip install memory_profiler.) It can track the memory usage of the Python codes. For more details, please see https://pypi.org/project/memory-profiler/.
To use it to track the memory usage, you can try the command below.
mprof run --interval=10 --multiprocess --include-children './your_bash_file.sh'
During the bash file running, you can plot the memory usage over time by the command below. Please replace mprofile_***.dat with the name of the profile results you want to plot (the lastest dat file will be used if the file is not specified). The figure will be saved as memory_profile_result.png.
After download the data, you can replace the two folders (for training and testing data) in ./example/data/ by the two large scale datasets. Then, you can modify and run the script ./example/scripts/run_mico.sh.
License
This project is licensed under the Apache-2.0 License.
Mutual Information Co-training
This repository is the source code for the paper:
MICO: Selective Search with Mutual Information Co-training
In Proceedings of the International Conference on Computational Linguistics (COLING) , 2022
Zhanyu Wang, Xiao Zhang, Hyokun Yun, Choon Hui Teo and Trishul Chilimb
Introduction
This is the package of Mutual Information Co-training (MICO) for End2End topic sharding. MICO uses BERT to generate sentence representations, and performs query routing and document assignment with the representations. The document assignment module in MICO outputs almost equal-sized clusters, and the query routing module routes the queries to the cluster containing most (if not all) of its relevant documents. MICO achieves very high performance for topic sharding.
This package can be tested through the example usage below.
Usage
You can save the command below as a bash file and run it in the current folder. You can also find and run it in
./example/scripts/run_mico.sh. It will take less than 5 minutes to finish running.The results will be saved in
./results/. In the folderexample_pair_BERT-finetune_layer-1_CLS_TOKEN_maxlen64_bs64_lr-bert5e-6_lr2e-4_warmup1000_entropy5_seed1for this example experiment, we can see the final evaluation metrics saved inmetrics.json. The document assigned to the clusters are saved inclustered_docs.jsonin a dictionary. The log files for training and evaluation are*.log. The model is saved as*.pt. The folder./logcontains Tensorboard results for visualization.The
dataset_namein the training command is set asexamplesince we have an example dataset saved in../example/data/example_dataset/. You can change thetrain_folder_pathandtest_folder_pathaccording to your needs.During training, the
batch_sizeis for each GPU card. If the current choice ofbatch_sizeis good on a machine with one GPU, we do not need to change it when switching to machines with more than one GPU (each with the same GPU memory). This is because we use theDistributedDataParallelfunction inPyTorchto support multi-GPU training: we assign one sub-process for each GPU and it maintains its own dataloader and counts its own epoch number (hence people usually focus on the iteration number instead of the epoch number). For a 4-GPU machine, finishing one epoch for each process means training the model for 4 epochs in total. For a GPU with 16GB memory, settingbatch_size=64is good for the first try.During testing, we use
DataParallelinPyTorchfor better efficiency (we only go through the dataset once with multi-GPU, much less than usingDistributedDataParallel), and thebatch_sizeis across all GPUs. Usually for testing, you can set a much largerbatch_sizethan the one used in training, e.g., for four GPUs (each with 16GB memory), we can usebatch_size=2048. You can also test the trained model directly by setting--eval_only.Visualize results with Tensorboard
To visualize the curves of the metrics calculated during training and evaluation, please use Tensorboard (for
Pytorchwe useTensorboardXwhich is installed in the setting up section.)The results for each experiment is saved in the folder specified by
--model_pathin the bash commands. We also have log files in text format in that folder. After running the following command, you can open your browser and typelocalhost:14095to view the training results.Memory profiling
Although we have adopted several techniques to decrease the memory usage, it is still possible that one encounters memory problem when running with large scale dataset. You can try this memory profiling method to estimate how much memory you will need for running MICO.
Some tips:
num_worker=0is a good way to save memory and it almost does not affect the training speed.export CUDA_VISIBLE_DEVICES=0to only use 1 GPU in training to save memory.To use the memory profiling method below, please make sure that the python package
memory_profileris installed. (If not, you can install it withpip install memory_profiler.) It can track the memory usage of the Python codes. For more details, please see https://pypi.org/project/memory-profiler/.To use it to track the memory usage, you can try the command below.
During the bash file running, you can plot the memory usage over time by the command below. Please replace
mprofile_***.datwith the name of the profile results you want to plot (the lastestdatfile will be used if the file is not specified). The figure will be saved asmemory_profile_result.png.Setting up a new EC2 machine
For setting up a new EC2 machine to run the scripts, please use the codes below
After download the data, you can replace the two folders (for training and testing data) in
./example/data/by the two large scale datasets. Then, you can modify and run the script./example/scripts/run_mico.sh.License
This project is licensed under the Apache-2.0 License.