Towards Video Text Visual Question Answering: Benchmark and Baseline [Paper] Accepted by NeurIPS 2022.
Describition
A novel task named Video Text Visual Question Answering (ViteVQA in short), which aims at answering questions by jointly reasoning textual and visual information in a given video.
The first ViteVQA benchmark dataset, which is named Multi-category Multi-frame Multi-resolution Multi-modal benchmark for ViteVQA (M4-ViteVQA in short).
Some examples from our M4-ViteVQA dataset. ‘A’ indicates the answers returned by TextVQA models and wrong answers are colored in red. Leveraging temporal, textual, and visual information in the video
is the only way to correctly answer the question.
Major features for ViteVQA:
Challenging: It must jointly exploit both textual and visual information as well as temporal logic among video frames or events.
General: As an extension to the TextVQA task, it is more general and has wider applications since it can answer temporal-related questions as well as handle video streams.
Major features for M4-ViteVQA:
Multi-category: M4-ViteVQA consists of 7,620 video clips of nine categories (i.e., shopping, traveling, driving, vlog, sport, advertisement, movie, game and talking).
Multi-resolution: M4-ViteVQA has three kinds of resolutions (i.e., 720p, 1080p and 1176X664)
Three settings: Two generic QA settings and one domain adaption setting to deal with unlearned content and completely different category-specific questions
Tasks and Metrics
We define 2 tasks with 3 settings for the ViteVQA problem, namely regular QA task (Task1) and domain adaption task (Task2).
In Task1, the model is trained and tested on all the nine categories of M4-ViteVQA, which is a regular setting. In order to meet the different requirements for the robustness of the model, we consider two data splits for Task1. The first one is called Task1Split1 that is divided according to the 7,620 cropped videos, the second one is called Task1Split2 that is divided by the 1,150 raw videos.
In Task2, the model is trained with seven categories while tested on the remaining two categories. Task2 requires the model to deal with unlearned content and completely different category-specific questions, which is very challenging. It is worth mentioning that for Task2, we provide an extra set, which can be used in different ways (e.g. semi-supervised learning, weakly supervised learning etc.) to improve the adaption ability of the model.
Two metrics are used in this benchmark to evaluate model performance. The first is accuracy used in existing TextVQA benchmarks. The second is the average normalized Levenshtein similarity (ANLS).
Download Links and Accessibility
The download links of annotations are given as follows (v0.0.2):
Feel free to contact us if you have any problems! zhaomy20@fudan.edu.cn or bjli20@fudan.edu.cn
Citation
@article{zhao2022towards,
title={Towards video text visual question answering: benchmark and baseline},
author={Zhao, Minyi and Li, Bingjia and Wang, Jie and Li, Wanqing and Zhou, Wenjing and Zhang, Lan and Xuyang, Shijie and Yu, Zhihang and Yu, Xinkun and Li, Guangze and others},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={35549--35562},
year={2022}
}
License
The researcher shall use the M4-ViteVQA dataset only for non-commercial algorithm research and educational purposes. The researcher can not use the M4-ViteVQA dataset for any other purposes, including but not limited to distribution, commercial usage, etc…
The researcher takes full responsibility for his or her use of the M4-ViteVQA dataset and shall defend and indemnify the dataset, including their affiliates, employees, trustees, officers and agents, against any and all claims arising from the researcher’s use of the M4-ViteVQA dataset.
The researcher agrees and confirms that authors reserve the right to terminate the researcher’s access to the M4-ViteVQA dataset at any time.
If the researcher is employed by a for-profit business entity, the researcher’s employer shall also be bound by these terms and conditions, and the researcher hereby shall represent that he or she is fully authorized to enter into this agreement on behalf of such employer.
ViteVQA
Towards Video Text Visual Question Answering: Benchmark and Baseline [Paper] Accepted by NeurIPS 2022.
Describition
A novel task named Video Text Visual Question Answering (ViteVQA in short), which aims at answering questions by jointly reasoning textual and visual information in a given video.
The first ViteVQA benchmark dataset, which is named Multi-category Multi-frame Multi-resolution Multi-modal benchmark for ViteVQA (M4-ViteVQA in short).
Major features for ViteVQA:
Major features for M4-ViteVQA:
Tasks and Metrics
We define 2 tasks with 3 settings for the ViteVQA problem, namely regular QA task (Task1) and domain adaption task (Task2).
In Task1, the model is trained and tested on all the nine categories of M4-ViteVQA, which is a regular setting. In order to meet the different requirements for the robustness of the model, we consider two data splits for Task1. The first one is called Task1Split1 that is divided according to the 7,620 cropped videos, the second one is called Task1Split2 that is divided by the 1,150 raw videos.
In Task2, the model is trained with seven categories while tested on the remaining two categories. Task2 requires the model to deal with unlearned content and completely different category-specific questions, which is very challenging. It is worth mentioning that for Task2, we provide an extra set, which can be used in different ways (e.g. semi-supervised learning, weakly supervised learning etc.) to improve the adaption ability of the model.
Two metrics are used in this benchmark to evaluate model performance. The first is accuracy used in existing TextVQA benchmarks. The second is the average normalized Levenshtein similarity (ANLS).
Download Links and Accessibility
The download links of annotations are given as follows (v0.0.2):
Task1Split1Train | Task1Split1Val | Task1Split1Test
Task1Split2Train | Task1Split2Val | Task1Split2Test
Task2Train | Task2Val | Task2Test | Task2Extra
Following features are provided:
Instructions:
Baseline
We provide T5-ViteVQA as a baseline for ViteVQA.
Table Ranking
TodoList
Organization
Affiliations:
Fudan University, ByteDance
Authors:
Minyi Zhao, Bingjia Li, Jie Wang, Wanqing Li, Wenjing Zhou, Lan Zhang Shijie Xuyang, Zhihang Yu, Xinkun Yu, Guangze Li, Aobotao Dai, Shuigeng Zhou
Contact Us
Feel free to contact us if you have any problems! zhaomy20@fudan.edu.cn or bjli20@fudan.edu.cnCitation
License