We introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and prompt-free paradigm, with zero inference and transferring overhead compared with closed-set YOLOs.
Abstract
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with $3\times$ less training cost and $1.4\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 $AP^b$ and 0.4 $AP^m$ gains over closed-set YOLOv8-L with nearly $4\times$ less training time.
Performance
Zero-shot detection evaluation
Fixed AP is reported on LVIS minival set with text (T) / visual (V) prompts.
Training time is for text prompts with detection based on 8 Nvidia RTX4090 GPUs.
FPS is measured on T4 with TensorRT and iPhone 12 with CoreML, respectively.
For training data, OG denotes Objects365v1 and GoldG.
YOLOE can become YOLOs after re-parameterization with zero inference and transferring overhead.
conda create -n yoloe python=3.10 -y
conda activate yoloe
# If you clone this repo, please use this
pip install -r requirements.txt
# Or you can also directly install the repo by this
pip install git+https://github.com/THU-MIG/yoloe.git#subdirectory=third_party/CLIP
pip install git+https://github.com/THU-MIG/yoloe.git#subdirectory=third_party/ml-mobileclip
pip install git+https://github.com/THU-MIG/yoloe.git#subdirectory=third_party/lvis-api
pip install git+https://github.com/THU-MIG/yoloe.git
wget https://docs-assets.developer.apple.com/ml-research/datasets/mobileclip/mobileclip_blt.pt
Demo
If desired objects are not identified, pleaset set a smaller confidence threshold, e.g., for visual prompts with handcrafted shape or cross-image prompts.
# For models with l scale, please change the initialization by referring to the comments in Line 549 in ultralytics/nn/moduels/head.py
# If you want to train YOLOE only for detection, you can use `train.py`
python train_seg.py
Visual prompt
# For visual prompt, because only SAVPE is trained, we can adopt the detection pipeline with less training time
# First, obtain the detection model
python tools/convert_segm2det.py
# Then, train the SAVPE module
python train_vp.py
# After training, please use tools/get_vp_segm.py to add the segmentation head
# python tools/get_vp_segm.py
Prompt free
# Generate LVIS with single class for evaluation during training
python tools/generate_lvis_sc.py
# Similar to visual prompt, because only the specialized prompt embedding is trained, we can adopt the detection pipeline with less training time
python tools/convert_segm2det.py
python train_pe_free.py
# After training, please use tools/get_pf_free_segm.py to add the segmentation head
# python tools/get_pf_free_segm.py
Export
After re-parameterization, YOLOE-v8 / YOLOE-11 can be exported into the identical format as YOLOv8 / YOLO11, with zero overhead for inference.
If our code or models help your work, please cite our paper:
@misc{wang2025yoloerealtimeseeing,
title={YOLOE: Real-Time Seeing Anything},
author={Ao Wang and Lihao Liu and Hui Chen and Zijia Lin and Jungong Han and Guiguang Ding},
year={2025},
eprint={2503.07465},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.07465},
}
YOLOE: Real-Time Seeing Anything
Official PyTorch implementation of YOLOE.
Comparison of performance, training cost, and inference efficiency between YOLOE (Ours) and YOLO-Worldv2 in terms of open text prompts.
YOLOE: Real-Time Seeing Anything.

Ao Wang*, Lihao Liu*, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding
We introduce YOLOE(ye), a highly efficient, unified, and open object detection and segmentation model, like human eye, under different prompt mechanisms, like texts, visual inputs, and prompt-free paradigm, with zero inference and transferring overhead compared with closed-set YOLOs.
Abstract
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with $3\times$ less training cost and $1.4\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 $AP^b$ and 0.4 $AP^m$ gains over closed-set YOLOv8-L with nearly $4\times$ less training time.Performance
Zero-shot detection evaluation
minival
set with text (T) / visual (V) prompts.Zero-shot segmentation evaluation
val
set with text (T) / visual (V) prompts.Prompt-free evaluation
minival
set and FPS is measured on Nvidia T4 GPU with Pytorch.Downstream transfer on COCO
Installation
You could also quickly try YOLOE for prediction and transferring using colab notebooks.
conda
virtual environment is recommended.Demo
If desired objects are not identified, pleaset set a smaller confidence threshold, e.g., for visual prompts with handcrafted shape or cross-image prompts.
Prediction
For yoloe-(v8s/m/l)/(11s/m/l)-seg, Models can also be automatically downloaded using
from_pretrained
.Text prompt
Visual prompt
Prompt free
Transferring
After pretraining, YOLOE-v8 / YOLOE-11 can be re-parameterized into the same architecture as YOLOv8 / YOLO11, with zero overhead for transferring.
Linear probing
Only the last conv, ie., the prompt embedding, is trainable.
Full tuning
All parameters are trainable, for better performance.
Validation
Data
minival.txt
with background images for evaluation.Zero-shot evaluation on LVIS
python val.py
.python val_vp.py
For Fixed AP, please refer to the comments in
val.py
andval_vp.py
, and usetools/eval_fixed_ap.py
for evaluation.Prompt-free evaluation
Downstream transfer on COCO
Training
The training includes three stages:
Data
For annotations, you can directly use our preprocessed ones or use the following script to obtain the processed annotations with segmentation masks.
Then, please generate the data and embedding cache for training.
At last, please download MobileCLIP-B(LT) for text encoder.
Text prompt
Visual prompt
Prompt free
Export
After re-parameterization, YOLOE-v8 / YOLOE-11 can be exported into the identical format as YOLOv8 / YOLO11, with zero overhead for inference.
Benchmark
benchmark.sh
.tools/benchmark_pf.py
.Acknowledgement
The code base is built with ultralytics, YOLO-World, MobileCLIP, lvis-api, CLIP, and GenerateU.
Thanks for the great implementations!
Citation
If our code or models help your work, please cite our paper: