ESOD: Efficient Small Object Detection on High-Resolution Images

This repository is the offical implementation of Efficient Small Object Detection on High-Resolution Images.

Installation

Python>=3.6.0 is required with all requirements.txt installed including PyTorch>=1.7:

# (optional) conda install cuda-toolkit=11.8 -c pytorch

pip install -r requirements.txt 
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
pip install setuptools==59.5.0

Data Preparation

We currently support VisDrone, UAVDT, and TinyPerson datasets. Follow the instructions below to prepare datasets.

Data Prepare (click to expand)

Darknet Format: The Darknet framework locates labels automatically for each image by replacing the last instance of /images/ in each image path with /labels/. For example:

dataset/images/im0.jpg
dataset/labels/im0.txt

The images and labels from VisDrone, UAVDT, TinyPerson are all organized in this format.

Ground-Truth Heatmap: We recommend to leverage the segment-anything model (SAM) to introduce precise shape prior to the GT heatmaps for training. You need to install SAM first:

cd third_party/segment-anything
pip install -e .

Dataset - VisDrone: Download the data, and ensure the subsets under the /path/to/visdrone directory are as follows:

VisDrone2019-DET-train  VisDrone2019-DET-val  VisDrone2019-DET-test-dev  VisDrone2019-DET-test-challenge

Then make a soft-link to your directory, and run the scripts/data_prepare.py script to reorganize the images and labels:

ln -sf /path/to/visdrone VisDrone
python scripts/data_prepare.py --dataset VisDrone

Dataset - UAVDT: Download the data and perform similar preprocessing:

ln -sf /path/to/uavdt UAVDT
python scripts/data_prepare.py --dataset UAVDT

Dataset - TinyPerson: Download the data and perform similar preprocessing:

ln -sf /path/to/tinyperson TinyPerson
python scripts/data_prepare.py --dataset TinyPerson

Training

Run commands below to reproduce results on the datasets, e.g., VisDrone. Download the pretrained weights (e.g., YOLOv5m) and put them to the weights/pretrained/ directory first.

Training on Single GPU

Here are the default setting to adapt YOLOv5m to VisDrone using our ESOD framework:

DATASET=visdrone MODEL=yolov5m GPUS=0 BATCH_SIZE=8 IMAGE_SIZE=1536 EPOCHS=50 bash ./scripts/train.sh

DDP Training

When multiple GPUs are available, the DistributedDataParallel mode can be applied to speed up the training. Simply set GPUS according to your devices, e.g. GPUS=0,1,2,3

DATASET=visdrone MODEL=yolov5m GPUS=0,1,2,3 BATCH_SIZE=32 IMAGE_SIZE=1536 EPOCHS=50 bash ./scripts/train.sh

Model Card

We support the YOLOv5 series (YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x). After downloading them to weights/, simply change the MODEL (as well as IMAGE_SIZE) to apply different model:

DATASET=visdrone MODEL=yolov5s GPUS=0 BATCH_SIZE=8 IMAGE_SIZE=1536 EPOCHS=50 bash ./scripts/train.sh
                       yolov5m                   8            1536
                       yolov5l                   4            1920
                       yolov5x                   2            1920

Besides, we also support the RetinaNet, RTMDet, YOLOv8, and GPViT models. You can download the pre-trained weights, convert to ESOD initialization, and train the models to specific datasets:

python scripts/model_convert.py --model retinanet
DATASET=visdrone MODEL=retinanet GPUS=0 BATCH_SIZE=8 IMAGE_SIZE=1536 EPOCHS=50 bash ./scripts/train.sh

Feel free to set MODEL as retinanet, rtmdet, or gpvit (yolov8m does not require model convert). The detailed instructions will come soon.

Dataset Supported

Beside VisDrone, we also support model training on UAVDT and TinyPerson datasets:

DATASET=uavdt MODEL=yolov5m GPUS=0 BATCH_SIZE=8 IMAGE_SIZE=1280 EPOCHS=50 bash ./scripts/train.sh
        tinyperson                                         2048

Testing

Vanilla Evaluation

Run commands below to compute evaluation results (AP, AP₅₀, empty rate, missing rate) with intergrated utils/metrics.py.

python test.py --data data/visdrone.yaml --weights weights/yolov5m.pt --batch-size 8 --img-size 1536 --device 0

For computational analysis (including GFLOPs and FPS), use the following command:

python test.py --data data/visdrone.yaml --weights weights/yolov5m.pt --batch-size 1 --img-size 1536 --device 0 --task measure

Official Evaluation

The organizations of data for official evaluation tools are different from Darknet. So an intermediate data conversion is required. Run command below to get the results in Darknet format.

python test.py --data data/visdrone.yaml --weights weights/yolov5m.pt --batch-size 8 --img-size 1536 --device 0 --task test --save-txt --save-conf

Then run the specified script data_convert.py for corresponding data formats and perform official evaluations.

VisDrone test-dev set:

cp -r ./evaluate/VisDrone2018-DET-toolkit ./VisDrone/
python data_convert.py --dataset VisDrone --pred runs/test/exp/labels
cd ./VisDrone/VisDrone2018-DET-toolkit
matlab -nodesktop -nosplash -r evalDET

UAVDT : coming soon.
TinyPerson: coming soon.

Inference

The script detect.py runs inference and saves the results to runs/detect.

python detect.py --weights weights/yolov5m.pt --source data/images/visdrone.txt --img-size 1536 --device 0 --view-cluster --line-thickness 1

--view-cluster will draw the generated patches in green boxes and save the heat maps from both prediction and ground truth.

Pretrained Weights

Please find the pre-trained weights from Google Drive

Acknowledgment

A large part of the code is borrowed from YOLO. Many thanks for this wonderful work.

Citation

If you find this work useful in your research, please kindly cite the paper:

@article{liu2025esod,
    title={ESOD: Efficient Small Object Detection on High-Resolution Images},
    author={Liu, Kai and Fu, Zhihang and Jin, Sheng and Chen, Ze and Zhou, Fan and Jiang, Rongxin and Chen, Yaowu and Ye, Jieping},
    journal={IEEE Transactions on Image Processing},
    volume={34},
    pages={183--195},
    year={2025}
}