mirrors/VTVQA

ViteVQA

Towards Video Text Visual Question Answering: Benchmark and Baseline [Paper] Accepted by NeurIPS 2022.

Describition

A novel task named Video Text Visual Question Answering (ViteVQA in short), which aims at answering questions by jointly reasoning textual and visual information in a given video.

The first ViteVQA benchmark dataset, which is named Multi-category Multi-frame Multi-resolution Multi-modal benchmark for ViteVQA (M4-ViteVQA in short).

Some examples from our M4-ViteVQA dataset. ‘A’ indicates the answers returned by TextVQA models and wrong answers are colored in red. Leveraging temporal, textual, and visual information in the video is the only way to correctly answer the question.

Major features for ViteVQA:

Challenging: It must jointly exploit both textual and visual information as well as temporal logic among video frames or events.
General: As an extension to the TextVQA task, it is more general and has wider applications since it can answer temporal-related questions as well as handle video streams.

Major features for M4-ViteVQA:

Multi-category: M4-ViteVQA consists of 7,620 video clips of nine categories (i.e., shopping, traveling, driving, vlog, sport, advertisement, movie, game and talking).
Multi-resolution: M4-ViteVQA has three kinds of resolutions (i.e., 720p, 1080p and 1176X664)
Dataset size: 7,620 video clips; 25,123 question-answer pairs; 1,317,392 frames; 500GB storage cost
Three settings: Two generic QA settings and one domain adaption setting to deal with unlearned content and completely different category-specific questions

Tasks and Metrics

We define 2 tasks with 3 settings for the ViteVQA problem, namely regular QA task (Task1) and domain adaption task (Task2).

In Task1, the model is trained and tested on all the nine categories of M4-ViteVQA, which is a regular setting. In order to meet the different requirements for the robustness of the model, we consider two data splits for Task1. The first one is called Task1Split1 that is divided according to the 7,620 cropped videos, the second one is called Task1Split2 that is divided by the 1,150 raw videos.

In Task2, the model is trained with seven categories while tested on the remaining two categories. Task2 requires the model to deal with unlearned content and completely different category-specific questions, which is very challenging. It is worth mentioning that for Task2, we provide an extra set, which can be used in different ways (e.g. semi-supervised learning, weakly supervised learning etc.) to improve the adaption ability of the model.

Two metrics are used in this benchmark to evaluate model performance. The first is accuracy used in existing TextVQA benchmarks. The second is the average normalized Levenshtein similarity (ANLS).

Download Links and Accessibility

The download links of annotations are given as follows (v0.0.2):

Task1Split1Train | Task1Split1Val | Task1Split1Test

Task1Split2Train | Task1Split2Val | Task1Split2Test

Task2Train | Task2Val | Task2Test | Task2Extra

Following features are provided:

Raw videos and frames
OCR tokens and FRCN features (Extracted by TransVTSpotter, ABINet and Faster R-CNN, respectively)
Object FRCN features and S3D video features (Extracted by Here)

Instructions:

M4-ViteVQA can only be used for non-commercial research purposes only
Researchers should read the responsibility agreement carefully, ask their supervisor/advisor to sign the agreement appropriately, and then send the scanned version to zhaomy20@fudan.edu.cn or bjli20@fudan.edu.cn.
After verifying your request, we will provide the dataset download link. Unzip password is m4-vitevqa

Baseline

We provide T5-ViteVQA as a baseline for ViteVQA.

Table Ranking

TodoList

Organization

Affiliations:

Fudan University, ByteDance

Authors:

Minyi Zhao, Bingjia Li, Jie Wang, Wanqing Li, Wenjing Zhou, Lan Zhang Shijie Xuyang, Zhihang Yu, Xinkun Yu, Guangze Li, Aobotao Dai, Shuigeng Zhou

Contact Us

Feel free to contact us if you have any problems! zhaomy20@fudan.edu.cn or bjli20@fudan.edu.cn

Citation

@article{zhao2022towards,
  title={Towards video text visual question answering: benchmark and baseline},
  author={Zhao, Minyi and Li, Bingjia and Wang, Jie and Li, Wanqing and Zhou, Wenjing and Zhang, Lan and Xuyang, Shijie and Yu, Zhihang and Yu, Xinkun and Li, Guangze and others},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={35549--35562},
  year={2022}
}

License

The researcher shall use the M4-ViteVQA dataset only for non-commercial algorithm research and educational purposes. The researcher can not use the M4-ViteVQA dataset for any other purposes, including but not limited to distribution, commercial usage, etc…
The researcher takes full responsibility for his or her use of the M4-ViteVQA dataset and shall defend and indemnify the dataset, including their affiliates, employees, trustees, officers and agents, against any and all claims arising from the researcher’s use of the M4-ViteVQA dataset.
The researcher agrees and confirms that authors reserve the right to terminate the researcher’s access to the M4-ViteVQA dataset at any time.
If the researcher is employed by a for-profit business entity, the researcher’s employer shall also be bound by these terms and conditions, and the researcher hereby shall represent that he or she is fully authorized to enter into this agreement on behalf of such employer.