X-modaler is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval). This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.
We provide a script in “train_net.py”, that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.
To train a model(e.g., UpDown) with “train_net.py”, first setup the corresponding datasets following datasets, then run:
If you use X-modaler in your research, please use the following BibTeX entry.
@inproceedings{Xmodaler2021,
author = {Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and Tao Mei},
title = {X-modaler: A Versatile and High-performance Codebase for Cross-modal Analytics},
booktitle = {Proceedings of the 29th ACM international conference on Multimedia},
year = {2021}
}
X-modaler
X-modaler is a versatile and high-performance codebase for cross-modal analytics (e.g., image captioning, video captioning, vision-language pre-training, visual question answering, visual commonsense reasoning, and cross-modal retrieval). This codebase unifies comprehensive high-quality modules in state-of-the-art vision-language techniques, which are organized in a standardized and user-friendly fashion.
The original paper can be found here.
Installation
See installation instructions.
Requiremenets
Getting Started
See Getting Started with X-modaler
Training & Evaluation in Command Line
We provide a script in “train_net.py”, that is made to train all the configs provided in X-modaler. You may want to use it as a reference to write your own training script.
To train a model(e.g., UpDown) with “train_net.py”, first setup the corresponding datasets following datasets, then run:
Model Zoo and Baselines
A large set of baseline results and trained models are available here.
Image Captioning on MSCOCO (Cross-Entropy Loss)
Image Captioning on MSCOCO (CIDEr Score Optimization)
Video Captioning on MSVD
Video Captioning on MSR-VTT
Visual Question Answering
Caption-based image retrieval on Flickr30k
Visual commonsense reasoning
License
X-modaler is released under the Apache License, Version 2.0.
Citing X-modaler
If you use X-modaler in your research, please use the following BibTeX entry.