Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

⚠️ Note: This is a reference implementation. Maintenance is scheduled for one year from the date of release. For long-term updates and continued development, please refer to the official repository: https://github.com/yutao1024/RACLO.

[📝 arXiv Paper] [🗂️ RVMS-Bench Data]

This project introduces RVMS-Bench, a novel benchmark for evaluating open-domain video retrieval, and RACLO, an agentic framework designed to mimic the human cognitive process of searching and localizing video memories in the real world.

💡 Project Overview

Traditional video retrieval typically relies on a closed-set pool of pre-downloaded candidates with limited query dimensions. To break this assumption, we introduce the RVMS paradigm — a novel pipeline that leverages rich memory contexts and multi-hop reasoning to operate directly on the open web, targeting real-world long videos and specific key frames.

1. The RACLO Agent Framework

RACLO is built on a “Recall-Search-Verify” architecture that tackles open-world video retrieval in 3 progressive stages:

Abductive Query Reasoning: Instead of standard keyword matching, the agent’s “brain” translates fragmented, multi-dimensional human memory cues into optimized search queries and semantic associations.
ReAct-Driven Candidate Sourcing: Using an “Observe-Think-Act” loop, the agent interacts with search engines (e.g., YouTube via SerpApi) to discover, filter, and fetch candidate audio-visual streams.
Dual-Granularity Verification: A parallel mechanism that authenticates the video at a macro level by matching the overall global impression while simultaneously grounding the exact frame at a micro level using key moments, temporal context, and audio cues.

2. The RVMS-Bench Dataset

To properly evaluate real-world retrieval, we constructed RVMS-Bench:

Scale & Diversity: 1,440 heavily verified, high-quality samples sourced from 20 diverse web video categories (Animation, Tech, Variety Shows, etc.).
Cognitive Dimension Tasks: The dataset features 9 distinct retrieval tasks built around 4 human memory cues: Global Impression (G), Key Moment (K), Temporal Context (T), and Auditory Memory (A).
Bias-Free Distribution: Strictly balanced across task types, video topics, and duration intervals (from less than 3 mins to 1 hour).
Rigorous Pipeline: Generated via Gemini 3 Pro and subjected to strict human verification to guarantee semantic uniqueness and eliminate model hallucinations.

🛠️ Developer Guide

Follow these steps to set up the environment, configure your agents, and run the pipeline.

Step 1: Environment Setup

Clone the repository and spin up an isolated Conda environment.

git clone https://github.com/yutao1024/RACLO.git
cd RACLO

conda create -n RACLO python=3.11.5 -y
conda activate RACLO
pip install -r requirements.txt

Note: You must also install system-level dependencies for video and audio processing.

sudo apt update
sudo apt install -y ffmpeg nodejs npm

Step 2: Acquire the Data

Download the RVMS-Bench dataset from RVMS-Bench Data. Once extracted, organize your project directory to match the following schema:

RACLO/
├── data_json/ 
│   └── video_dataset.json        # Core metadata
├── images/ 
│   ├── youtube_00EojgxB-8g.jpg   # Ground truth frames mapping to video IDs
│   └── ...

Step 3: System Configuration

RACLO requires API keys for reasoning and search, as well as cookies for stable video extraction. Create or edit ./config/config.yaml:

# VLM's API
GEMINI_API_KEY: "your_gemini_key"
OPENAI_API_KEY: "your_openai_key"
ANTHROPIC_API_KEY: "your_anthropic_key"
DASHSCOPE_API_KEY: "your_qwen_key"
HUNYUAN_API_KEY: "your_hunyuan_key"
# Search Tooling
SERPAPI_KEY: "your_serpapi_key"
# yt-dlp Configuration
COOKIES_FILE: "./config/cookies.txt" # Export YouTube cookies here
NODE_PATH: "/usr/bin/node"

Tips: Prompt Configuration. Modify config/prompt.yaml to adjust the SEARCH_PROMPT. Specifically, you should update the instruction regarding the quantity of keywords to control the number of search queries generated per attempt, and adjust the quantity of the provided examples in the === EXAMPLES === section to ensure consistency with your requirements.

🚀 Running the Pipeline

Full Inference Execution

Launch run_inference.py to trigger the end-to-end RACLO pipeline (Reasoning → Web Search → Preprocessing → Verification → Grounding).

python run_inference.py \
    --input_path ./data_json/video_dataset.json \
    --config_path ./config/config.yaml \
    --prompt_path ./config/prompt.yaml \
    --model_name gemini-3-pro \
    --model_support_audio True

Supported Models: gemini-3-pro, gemini-2.5-pro, gpt-5.2, gpt-5-mini, gpt-4o, claude-4.0-Sonnet, hunyuan-video, qwen2.5-vl-72b, qwen3-vl-235b.

Evaluation

Once inference is complete, calculate the Video Match Accuracy (rule-based/model-based) and VLM Grounding Accuracy across the G, K, T, and A memory cue tasks.

python evaluation.py \
  --model_name gemini-3-pro \
  --dataset_path ./data/rvms_test_data.json \
  --result_dir ./results/vlm_test \
  --output_file ./results/evaluation_report.txt

📚 Citation

If you find it useful for your research and applications, please cite related papers/blogs using this BibTeX:

@misc{yu2026closedpoolvideoretrievalbenchmark,
      title={Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization}, 
      author={Tao Yu and Yujia Yang and Haopeng Jin and Junhao Gong and Xinlong Chen and Yuxuan Zhou and Shanbin Zhang and Jiabing Yang and Xinming Wang and Hongzhu Yi and Ping Nie and Kai Zou and Zhang Zhang and Yan Huang and Liang Wang and Yeshani and Ruiwen Tao and Jin Ma and Haijin Liang and Jinwen Luo},
      year={2026},
      eprint={2602.10159},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.10159}, 
}

⚖️ Disclaimer & License

This project is intended for academic research purposes only. All video content is sourced from publicly available YouTube videos; copyrights belong to the original creators.
The RVMS-Bench dataset contains only video URLs, text descriptions, and ground-truth keyframe annotations. No original video files are distributed. This follows the standard practice adopted by established video benchmarks such as Kinetics (CVPR 2017), ActivityNet Captions (ICCV 2017), and QVHighlights (NeurIPS 2021).
Video downloading occurs only at runtime on the user’s local machine for evaluation purposes and is governed by the user’s own compliance with applicable terms of service and copyright laws.
This project is released under the Apache License 2.0. Use of the dataset and code for any commercial purpose is not endorsed or supported.