目录
目录README.md

MVP: Multi-View Prediction for Stable GUI Grounding

arXiv License

🎯 Overview

MVP (Multi-View Prediction) is a training-free framework that addresses the critical issue of coordinate prediction instability in GUI grounding models. Our method significantly improves grounding performance by aggregating predictions from multiple carefully crafted views, effectively distinguishing stable coordinates from outliers.

MVP Framework

Figure: The MVP framework consists of two main components: (1) Attention-Guided View Proposal that generates diverse cropped views based on instruction-to-image attention, and (2) Multi-Coordinates Clustering that ensembles predictions by selecting the centroid of the densest spatial cluster.

🚀 Quick Start

Installation

git clone https://github.com/ZJUSCL/MVP.git
cd MVP
pip install -r requirements.txt

Datasets Download from Hugging Face

# Install huggingface_hub for dataset download
pip install huggingface_hub

# Download UI-Vision dataset
huggingface-cli download ServiceNow/ui-vision --local-dir ./data/ui-vision

# Download ScreenSpot-Pro dataset
huggingface-cli download likaixin/ScreenSpot-Pro --local-dir ./data/screenspot-pro

# Download OSWorld-G dataset
huggingface-cli download MMInstruction/OSWorld-G --local-dir ./data/osworld-g

Models Download from Hugging Face

# Download UI-TARS-1.5-7B model
huggingface-cli download ByteDance-Seed/UI-TARS-1.5-7B --local-dir ./models/UI-TARS-1.5-7B

# Download GTA1-7B model
huggingface-cli download HelloKKMe/GTA1-7B --local-dir ./models/GTA1-7B

# Download Qwen3VL-8B model
huggingface-cli download Qwen/Qwen3-VL-8B-Instruct --local-dir ./models/Qwen3-VL-8B-Instruct

# Download Qwen3VL-32B model
huggingface-cli download Qwen/Qwen3-VL-32B-Instruct --local-dir ./models/Qwen3-VL-32B-Instruct

Alternative: Using Git LFS

# For large models, you can also use Git LFS
git lfs install
git clone https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B ./models/UI-TARS-1.5-7B
git clone https://huggingface.co/HelloKKMe/GTA1-7B ./models/GTA1-7B
git clone https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct ./models/Qwen3-VL-8B-Instruct
git clone https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct ./models/Qwen3-VL-32B-Instruct

📊 Performance

ScreenSpot-Pro Benchmark Results

Model Development Creative CAD Scientific Office OS Overall
UI-TARS-1.5-7B 36.4 38.1 20.5 49.6 68.7 31.5 41.9
+ MVP 51.8↑15.4 50.0↑11.9 53.3↑32.8 57.9↑8.3 73.0↑4.3 54.6↑23.1 56.1↑14.2
GTA1-7B 43.4 44.8 44.4 55.9 74.8 35.2 49.8
+ MVP 58.9↑15.5 52.6↑7.8 60.2↑15.8 63.0↑7.1 79.1↑4.3 56.1↑20.9 61.7↑11.9
Qwen3VL-8B 52.8 49.1 49.0 56.7 75.2 50.5 55.0
+ MVP 61.5↑8.7 60.2↑11.1 61.3↑12.3 67.3↑10.6 82.6↑7.4 62.8↑12.3 65.3↑10.3
Qwen3VL-32B 43.1 54.4 57.5 62.6 73.0 42.3 55.3
+ MVP 71.6↑28.5 69.3↑14.9 74.7↑17.2 70.5↑7.9 87.4↑14.4 73.5↑31.2 74.0↑18.7

🛠️ Evaluation Scripts

Run All Experiments

We provide four main evaluation scripts for different model configurations:

# Run experiments for UI-TARS-1.5-7B and GTA1-7B
./eval_gta1.sh

# Run experiments for Qwen3VL-8B
./eval_qwen3vl8b.sh

# Run experiments for Qwen3VL-32B
./eval_uitars_1_5.sh

# Run all experiments sequentially
./eval_qwen3vl32b.sh

🔧 Core Components

1. Attention-Guided View Proposal

  • Generates multiple cropped views based on instruction-to-image attention
  • Focuses on relevant regions while maintaining context

2. Multi-Coordinates Clustering

  • Aggregates predictions from multiple views
  • Uses density-based clustering to identify stable coordinates
  • Selects centroid of densest cluster as final prediction

📄 Citation

If you find our work useful, please cite our paper:

@article{mvp2024,
  title={MVP: Multiple View Prediction Improves GUI Grounding},
  author={Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng and Linchao Zhu},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2512.08529}
}

📜 License

This project is licensed under the Apache License 2.0.

📧 Contact

For questions about this work, please open an issue or contact [yunzhuzhang0918@gmail.com].

关于
1.9 MB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

©Copyright 2023 CCF 开源发展委员会
Powered by Trustie& IntelliDE 京ICP备13000930号