CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes

Bin Xie¹, Congxuan Zhang¹^🖂, Fagan Wang¹, Peng Liu², Feng Lu¹, Zhen Chen¹, Weiming Hu³
¹Nanchang Hangkong University, Nanchang, China
²Beihang University, Beijing, China
³Chinese Academy of Sciences, Beijing, China
zcxdsg@nchu.edu.cn

Abstract

^†^†🖂 : Corresponding author.

The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.

Figure 1: Examples of sequence clips from the CST Anti-UAV dataset. The UAV is annotated with the red bounding box. The most notable feature of CST Anti-UAV is complex scenes, as shown at the bottom, including the occlusion (C), complex dynamic background (D), scale variation (S), thermal crossover (T), and out-of-view (V). The background covers buildings (B), and urban areas (U).

Table 1: Statistical comparison of CST Anti-UAV with current popular datasets for thermal infrared single object tracking in terms of the number of video sequences, bounding boxes, object size, and attributes. The number of tiny-sized objects in CST Anti-UAV is much larger than most thermal infrared tracking datasets. In the meantime, CST Anti-UAV is specifically for tracking UAVs in complex scenes and provides complete manual frame-level attribute annotations.

Benchmark	Sequences	BBoxes	Sequence level attribute		Frame level attribute		Object size
Benchmark	Sequences	BBoxes	Attribute	Sequences/Totals	Attribute	Frames/Totals	Tiny	Small	Normal	Large
PTB-TIR [35]	60	30k	9	60/60	-	-	0	1883	6325	73925
LSOTB-TIR [36]	1400	606k	12	120/1400	-	-	0	7625	30910	568294
Anti-UAV [29]	318	293k	7	318/318	-	-	858	6419	88181	198317
Anti-UAV410 [26]	410	438k	6	410/410	6	701k/2628k	17071	126649	97942	196735
CST Anti-UAV	220	240k	6	220/220	6	1440k/1440k	78224	155202	11208	837

1 Introduction

UAVs have rapidly proliferated due to their low cost and versatile applications in domains such as surveillance, delivery, and imaging [12, 30, 25, 62]. However, incidents of unauthorized or malicious drone flights have infringed on privacy and endangered public safety, raising serious security and privacy concerns [44]. Consequently, there is a growing need for anti-UAV systems capable of real-time detection and tracking of unauthorized UAVs. UAV tracking is a crucial component of such systems, providing the location and trajectory of the target drone for interception or other countermeasures.

Most existing SOT datasets [50, 51, 45, 33] are based on visible light images, leading to unreliable tracking results in low-light or foggy conditions. Although some Thermal Infrared (TIR) datasets [21, 20, 42, 52] have been proposed, they focus on general objects and may not be optimal for tiny UAV tracking. Anti-UAV [29] is the first dataset specifically designed for UAV tracking. Anti-UAV410 [26] was further expanded based on the Anti-UAV dataset by collecting some tiny UAVs and wild scenes. However, the existing datasets for tracking UAVs still have limitations as follows: 1) They lack a sufficient number of tiny UAV targets for effective training, thus limiting real-world applications. 2) The backgrounds of the existing datasets are relatively clean and quite simple, failing to capture the complexity of real-world tracking scenarios. 3) They have incomplete frame-level attribute annotations, hampering thorough evaluations of tracking methods.

To tackle the above problems, we propose a large-scale dataset for tracking UAVs in complex scenes, referred to as CST Anti-UAV. We use popular TIR single-object tracking datasets, including PTB-TIR [35], LSOTB-TIR [36], Anti-UAV [29], and Anti-UAV410 [26], as references to analyze the data statistics of the new CST Anti-UAV dataset. As shown in Table 1, the CST Anti-UAV dataset contains 220 sequences and over 240k high-quality bounding boxes, surpassing existing datasets in three aspects:

(1) A larger number of tiny-sized objects. Similar to Anti-UAV410, we classify the object sizes into four categories based on the diagonal length of their bounding boxes: tiny (0,10], small (10,30], normal (30,50], and large (50,inf]. Our dataset contains 78,224 tiny objects, which is 4.5 times larger than existing large datasets. Adequate sample size guarantees robust training performance.

(2) The diversity of complex tracking scenes. In the real world, salient objects rarely appear while complex and dynamic scenes happen frequently. We first introduce the Complex Dynamic Background (CDB) attribute, which includes numerous dynamic distractors in the background. Furthermore, real-world tracking scenarios often involve multiple challenges. Existing datasets lack diversity in this aspect, driving us to collect sequences with complex challenge settings. As shown in Figure 1, we demonstrate examples of complex scenarios featuring multiple attributes and tiny-size objects.

(3) The completeness of frame-level annotations. We provide manual frame-level annotations for 6 attributes across all frames, resulting in a total of 1440k annotations. Unlike existing sequence-level annotations, frame-level annotations precisely reveal tracker responses to diverse challenges, offering more accurate guidance for future research.

To analyze the proposed dataset, we retrained the 20 recent SOT methods on our dataset and conducted extensive evaluations. The experimental results demonstrate that complex tracking scenes and tiny-size objects make the SOT methods less pronounced. For example, SiamDT’s [26] State Accuracy (SA) performance drops from 67.69% on Anti-UAV410 to 35.84% on CST Anti-UAV, while GlobalTrack [27] declines from 66.42% to 35.92%, consistently revealing the challenges brought by the CST Anti-UAV dataset.

In summary, our main contributions are as follows:

•

We present CST Anti-UAV, a large-scale thermal infrared dataset specifically tailored for UAV tracking. Characterized by tiny-sized objects and complex scenes, it aims to drive progress in anti-UAV research.
•

To the best of our knowledge, CST Anti-UAV is the first dataset to provide complete manual frame-level attribute annotations, enabling fine-grained evaluations of existing methods under various challenging conditions.
•

To conduct a comprehensive experimental analysis on CST Anti-UAV, we benchmarked a range of current tracking methods, providing important insights for future SOT direction.

2 Related Work

Single object tracking methods. Based on the latest developments in visual tracking, the tracking methods can be classified into three groups: Transformer-based methods [9, 7, 57, 53, 16, 54, 34, 46, 8] leverage the self-attention mechanism to model long-range dependencies and capture global context. Siamese methods [60, 22, 61, 58, 56, 2, 11, 18, 32] employ twin networks to compare template and search features, balancing speed and accuracy. Other deep learning methods [4, 3, 37, 16, 15, 47, 40, 41] integrate diverse architectures like convolutional neural networks or attention mechanisms to address specific tracking challenges.

Single object tracking benchmarks. TrackingNet [39] is a visible object tracking benchmark dataset based on YouTube-BoundingBoxes [43], containing over 30,000 video clips with 27 target categories. LaSOT [19] is a high-quality benchmark that includes 1,400 visible videos, with an average of around 2,500 frames per video. GOT10K [28] is a visible object tracking dataset that covers 563 types of objects, containing more than 1.5 million manually annotated bounding boxes. PTB-TIR [35] is a thermal infrared pedestrian tracking dataset that contains 60 sequences collected from various devices. LSOTB-TIR [36] is currently the largest TIR object tracking dataset with 1,400 sequences, defining 12 challenge attributes.

Benchmarks for anti-UAV. Although several anti-UAV datasets have been proposed, they are still far from being practically applicable. Anti-UAV [29] is a dataset with visible and infrared dual-mode information, which consists of 318 fully labeled videos, making a significant contribution to maintaining airspace safety. However, most objects in the dataset are relatively large, and the backgrounds are clean and simple. DUT Anti-UAV [63] is divided into two tasks: detection and tracking. It contains 10,000 images for the detection task and 20 sequences for the tracking task. However, the small number of tracking sequences is insufficient for training tracking methods. Anti-UAV410 [26] extends Anti-UAV to track drones in wild scenes. It has collected 150k drone objects and merged the infrared portion of the Anti-UAV dataset, ultimately containing 410 sequences with 438k objects. The collected data includes some tiny objects and wild scenes. Unfortunately, the number of tiny objects is too limited, and the scenes are predominantly in wilderness areas, lacking truly complex urban areas with dynamic distractors involving pedestrians and vehicles. The singular attribute of sequences fails to capture the complexity of real tracking scenes. Furthermore, the frame-level annotations are incomplete, hindering comprehensive evaluation of tracker performance across various attributes.

Given the limitations in existing datasets, we develop CST Anti-UAV, a comprehensive benchmark providing extensive tiny objects, complex real-world tracking scenarios, and full attribute annotations. The aim is to advance UAV tracking in complex scenes and bridge the gap toward real-world applications.

Refer to caption — Figure 2: Main features and statistics of the proposed dataset. (a) Distribution of scene (outer) and object size (inner). (b) A comparison of the sequence-level attributes of existing anti-UAV tracking datasets and our proposed dataset, emphasizing their significance and applicability for practical anti-UAV tasks in terms of video count. The numbers from Zero to Five on the table represent the number of attributes contained in a sequence. (c) Comparative analysis of the relative size distributions of objects in Anti-UAV, Anti-UAV410, and CST Anti-UAV. (d) Statistics of frame-level attributes in CST Anti-UAV. The area of each cycle denotes the number of total frames.

3 CST Anti-UAV Benchmark

3.1 Data Acquisition and Analysis

Large-scale sequences with high diversity. Existing datasets are [29] typically recorded through a rotating platform equipped with a static camera, the acquisition scenes are limited only to a few selected. Therefore, we simultaneously adopt rotating platforms and professional drones equipped with infrared cameras (25 fps, 640 $\times$ 512) for data collection to ensure diversity. We carefully captured 220 sequences, comprising over 240k high-quality bounding boxes, and sequence lengths range from 600 to 2,062 frames (average 1,132 frames), covering both short-term and extended tracking scenarios. Furthermore, we classified UAV objects into four size categories, with their proportional distribution illustrated in the inner ring of Figure 2 (a). Notably, small and tiny objects are the domain, reflecting real-world tracking scenarios where UAVs are typically inconspicuous. Tiny objects are more difficult to track, while larger objects require fewer samples to achieve comparable performance. A key highlight is that our dataset contains 4.5 times more tiny objects than existing datasets. We further analyzed object size distribution compared to the Anti-UAV [29] and Anti-UAV410 [26] datasets, as shown in Figure 2 (c).

Diverse and complex real-world scenarios. Existing datasets typically feature simple and clean backgrounds, whereas our dataset captures rich and complex real-world scenes through a year-long data acquisition process. We comprehensively cover movement patterns including close-range, long-range, approaching, and receding trajectories, as well as diverse scenes such as urban areas, buildings, mountains, and skies, as quantified in the outer ring of Figure 2 (a). The dataset further incorporates seasonal variations (summer, autumn, winter), lighting conditions changes (day/night, sunny/cloudy), and extreme weather conditions like winds and fog. The background includes multiple dynamic distractors, such as moving pedestrians and vehicles that we introduced for the first time, birds, and airplanes. The complex backgrounds and significant weather or temperature variations in our dataset are crucial for training robust UAV tracking methods.

Table 2: Illustration of attribute annotation in CST Anti-UAV.

Attribute	Description
OC	Occlusion: the target is partially or fully occluded.
OV	Out-of-View: the target leaves the view.
SV	Scale Variation: the ratio of the bounding boxes of the first frame and the current frame is out of the range [0.66,1.5].
TC	Thermal Crossover: the target has a similar temperature to other objects or background surroundings.
DBC	Dynamic Background Clutter: there are dynamic changes in the background around the target.
CDB	Complex Dynamic Background: there are multiple dynamic non-target objects in the background.

Coarse-to-fine attribute annotation. The challenges in our dataset can be summarized as six attributes, namely Occlusion (OC), Out-of-View (OV), Scale Variation (SV), Thermal Crossover (TC), Dynamic Background Clutter (DBC), and Complex Dynamic Background (CDB), the detailed meanings of each attribute are as shown in Table 2. In previous datasets, the backgrounds were mostly clean and simple. However, in real-world operational environments including aerial surveying and traffic monitoring, backgrounds inevitably contain complex dynamic objects like pedestrians and vehicles. We propose the novel concept of CDB which incorporates these dynamic objects. In comparison with existing anti-UAV datasets [29, 26] on sequence-level attributes, as shown in Figure 2 (b). The number of almost all attributes is significantly higher than in existing datasets. Notably, the CST Anti-UAV dataset contains 184 sequences with more than three attributes, compared to only 60 sequences in the Anti-UAV410 dataset. Furthermore, existing infrared tracking datasets only annotate attributes at the sequence level. In fact, only a few frames in a sequence may correspond to a specific challenge, roughly evaluating multiple different challenges across an entire sequence leads to flawed analysis results. We first introduce full frame-level attribute annotations. As shown in Figure 2 (d), we show the number of frames for each challenge attribute.

3.2 High-quality Human Annotation

After collecting all the videos for the CST Anti-UAV dataset, we reviewed and edited them to reduce the number of inappropriate frames, resulting in over 240k frames. To ensure the quality of the annotations, we invested over 2,000 hours in the annotation process, which was divided into two tracks: 1) Bounding box annotation. Unlike existing datasets [29, 36] that rely on sparse annotation or algorithmic pre-labeling with manual correction, our annotation team manually obtained high-quality bounding boxes for each frame. Our verification team conducted continuous frame-by-frame checks until no issues remained. 2) Attribute annotation. To ensure consistency in evaluating the six proposed challenge attributes across frames, we assigned a single expert annotator to handle all attribute labeling, given that different annotators may exhibit varying tendencies. In total, we annotated 1440k attributes across all frames and provided 220 $\times$ 6 sequence-level attribute annotations.

3.3 Dataset Splitting

To enable fair comparisons in the evaluation of tracking methods and prevent overfitting, we split the CST Anti-UAV dataset into three sets: 120 sequences for training, 40 for validation, and 60 for testing, in accordance with the following rigorous criteria: 1) Each subset encompasses all scene categories and object size ranges. 2) Balanced distribution of challenges across the three sets. 3) The test set remains strictly independent, while training and validation sets derive from non-overlapping clips of shared sequences. The attribute distributions are illustrated in Figure 3. The distribution trends of various attributes are consistent across all subsets, which ensures our training set can better capture the challenges of UAV tracking in real-world scenes.

Table 3: The performance of the state-of-the-art trackers on CST Anti-UAV and Anti-UAV410 datasets, evaluated by the metrics mSA, the Area Under the Curve (AUC) of Success Plots (S) and Precision Plots (P). All three metrics are reported in percentage form. 410 denotes the Anti-UAV410 dataset, while CST refers to the CST Anti-UAV dataset. The top three are highlighted in red, blue, and green, with bold text for emphasis. The asterisk (*) denotes an infrared method.

Methods	Source	Trained with 410 and test on 410			Trained with CST and test on CST
Methods	Source	mSA	P(AUC)	S(AUC)	mSA	P(AUC)	S(AUC)
GlobalTrack [27]	AAAI20	66.42	85.25	65.11	35.92	58.72	35.38
KYS [4]	ECCV20	45.12	61.31	44.30	26.28	40.20	25.93
PrDiMP [17]	CVPR20	59.31	76.77	59.09	29.76	42.55	29.28
Stark [57]	ICCV21	57.09	76.64	55.99	23.27	40.54	23.10
Stark-ST [57]	ICCV21	58.32	78.10	57.22	25.63	44.99	25.44
OSTrack-256 [59]	ECCV22	52.47	70.39	51.44	26.34	46.95	26.16
OSTrack-384 [59]	ECCV22	57.41	76.73	56.29	25.25	45.99	25.10
AiATrack [23]	ECCV22	58.31	78.15	57.19	26.83	47.65	26.64
ToMP [38]	CVPR22	56.44	73.70	55.28	22.84	39.81	22.66
Mixformer [13]	CVPR22	58.49	78.07	57.35	26.46	47.14	26.19
Mixformer-V2 [14]	NIPS23	61.68	81.41	60.50	29.39	52.37	29.09
DropTrack [49]	CVPR23	61.00	81.11	59.81	26.89	49.08	26.65
ARTrack-256 [48]	CVPR23	48.14	63.75	47.17	25.14	45.39	24.96
ARTrack-384 [48]	CVPR23	52.63	69.51	51.54	25.19	46.56	25.04
SeqTrack [10]	CVPR23	50.50	67.26	49.49	26.53	47.67	26.38
GRM [24]	CVPR23	50.79	68.36	49.80	26.96	48.70	26.80
ROMTrack [6]	ICCV23	55.08	72.62	53.98	27.43	48.20	27.23
HIPTrack [5]	CVPR24	57.78	77.59	56.64	25.62	47.48	25.42
AQATrack [55]	CVPR24	56.64	75.33	55.52	27.60	50.43	27.45
ARTrackV2 [1]	CVPR24	52.51	70.41	51.49	22.14	39.73	22.03
SiamDT* [26]	PAMI24	67.69	87.63	66.34	35.84	57.96	35.28
Refocus-TIR* [31]	TNNLS24	58.24	76.31	57.09	26.51	47.49	26.26

4 Experiments

To thoroughly analyze the proposed CST Anti-UAV dataset, we retrained 20 of the latest methods on both CST Anti-UAV and Anti-UAV410, followed by comprehensive performance evaluations.

4.1 Experimental Settings

Evaluation metrics. We use three common evaluation metrics, namely precision plot, success plot, and state accuracy, to assess tracker performance through One-Pass Evaluation (OPE). The calculation methods for precision plots and success plots are the same as those in [50]. State accuracy consistent with the previous method [29], $\text{\emph{IoU}}_{t}$ represents the Intersection over Union (IoU) between the predicted bounding box and its corresponding ground-truth, while $\text{\emph{v}}_{t}$ denotes the visibility flags of the ground-truth annotations. The tracker’s predicted $\text{\emph{p}}_{t}$ is used to measure the state accuracy. The average state accuracy of all video sequences is the mSA.

$\displaystyle SA=\sum_{t}\frac{IOU_{t}\times\delta\left(v_{t}>0\right)+p_{t}\times\left(1-\delta\left(v_{t}>0\right)\right)}{T}.$

(1)

Baseline tracker. We conducted comprehensive evaluations of 20 single object trackers on both CST Anti-UAV and Anti-UAV410 [26] datasets. The transformer trackers include Stark [57], Stark-ST [57], OSTrack [59], AiATrack [23], MixFormer [13], MixFormerV2 [14], DropTrack [49], ARTrack [48], SeqTrack [10], GRM [24], ROMTrack [6], HIPTrack [5], AQATrack [55] and ARTrackV2 [1]. The other deep-learning trackers contain GlobalTrack [27], KYS [4], PrDiMP [17], ToMP [38], and Refocus-TIR [31]. For the Siamese trackers, we have only selected SiamDT [26], which is designed for TIR UAV tracking tasks. We conducted all experiments on a server equipped with 8 NVIDIA A100-PCIE-40GB GPUs and used the implementations, pre-trained models, and default configurations provided by the official code to ensure a fair comparison among the trackers.

4.2 Overall Performance

Table 3 presents the results on both Anti-UAV410 and CST Anti-UAV datasets, including state accuracy, and the Area Under the Curve (AUC) of success and precision plots. Notably, trackers that achieve superior performance on Anti-UAV410 exhibit significant degradation when evaluated on the proposed CST Anti-UAV dataset. The poor performance on CST Anti-UAV is not only due to complex backgrounds and tiny objects in static images but also because of object disappearance and reappearance across the temporal domain. Based on the evaluation, SiamDT [26] and GlobalTrack [27] secured the top two positions with an absolute advantage on both datasets. This indicates that trackers based on long-term tracking have the potential to achieve better performance, as they search in the whole image, allowing them to obtain the disappeared object’s reappearing position.

In addition to experiments between different trackers, we further explored the impact of various tracker versions and different search region sizes on their performance. Stark-ST, which adds a dynamically changing template to capture temporal variations in targets, achieved an improvement in performance. This demonstrates that effectively utilizing temporal information can help address changes in object size and appearance. OSTrack-384 [59] and ARTrack-384 [48] show notable performance improvements compared to OSTrack-256 and ARTrack-256 on Anti-UAV410 but have limited impact on CST Anti-UAV. We attribute this to the tiny size of the targets, which occupy fewer pixels in the image, making it harder for the network to effectively leverage its semantic representation capabilities.

Table 4: When trained separately on the Anti-UAV410 and CST Anti-UAV datasets, the mSA(%) performance of current trackers on the CST Anti-UAV test set. We selected ten of the methods for presentation. 410 denotes the Anti-UAV410 dataset, while CST refers to the CST Anti-UAV dataset.

Methods	Trained with 410	Trained with CST
Methods	Test on CST	Test on CST
GlobalTrack [27]	30.96	35.92(+16.02%)
OSTrack-256 [59]	21.81	26.34(+20.77%)
OSTrack-384 [59]	21.29	25.76(+21.00%)
SeqTrack [10]	20.36	26.53(+30.30%)
GRM [24]	21.64	26.96(+24.58%)
ROMTrack [6]	21.13	27.43(+29.82%)
AQATrack [55]	23.12	27.60(+19.38%)
ARTrackV2 [1]	19.14	22.14(+15.67%)
SiamDT [26]	30.95	35.84(+15.80%)
Refocus-TIR [31]	24.90	26.51(+6.47%)

Table 5: The performances mSA(%) of baseline trackers on the CST Anti-UAV test set with different training datasets are evaluated across various frame-level attributes, including six attributes and four size categories. Details regarding the definitions of attributes can be found in Table 2. 410 denotes the Anti-UAV410 dataset, while CST refers to the CST Anti-UAV dataset. The top three are highlighted in red, blue, and green respectively, with bold text for emphasis.

Methods	Training dataset	Frame-level attribute
Methods	Training dataset	OC	OV	SV	TC	DBC	CDB	TS	SS	NS	LS	ALL
KYS [4]	410	29.45	54.55	21.49	13.56	6.92	12.77	23.81	29.43	9.25	20.26	26.28
KYS [4]	CST	37.04	61.82	21.68	13.75	7.23	10.80	26.91	29.17	13.83	19.67	26.75
AiATrack [23]	410	30.37	8.52	20.40	14.32	6.56	14.56	13.26	26.78	42.74	11.27	22.37
AiATrack [23]	CST	37.42	16.64	25.79	18.49	12.15	21.24	17.23	32.23	39.21	14.91	26.83
ROMTrack [6]	410	25.63	8.99	16.85	12.81	7.68	17.69	11.58	25.08	31.05	14.01	21.13
ROMTrack [6]	CST	39.49	18.66	25.78	17.69	10.33	19.69	16.79	32.32	33.69	14.66	27.43
AQATrack [55]	410	30.13	8.56	20.13	13.60	10.16	18.01	15.24	27.65	31.65	12.64	23.12
AQATrack [55]	CST	33.76	16.24	24.00	17.75	10.59	19.48	20.37	32.07	26.82	14.65	27.60
Mixformer-V2 [14]	410	26.76	10.01	25.78	17.07	11.31	18.86	15.17	32.32	47.31	20.44	26.32
Mixformer-V2 [14]	CST	33.75	14.53	28.98	22.10	11.65	19.81	19.36	36.18	41.48	13.50	29.39
PrDiMP [17]	410	35.92	44.86	21.97	15.30	11.16	19.36	21.96	30.48	27.13	10.57	26.25
PrDiMP [17]	CST	51.25	57.26	27.86	16.99	11.50	21.81	28.20	33.45	29.85	12.94	29.76
SiamDT [26]	410	33.07	60.43	33.60	22.12	14.62	24.23	12.89	39.32	59.11	41.25	30.95
SiamDT [26]	CST	39.31	64.54	39.24	27.22	18.00	28.40	14.98	46.51	66.10	38.34	35.84
GlobalTrack [27]	410	32.86	57.71	31.36	20.38	12.77	23.67	13.81	39.12	59.73	20.64	30.96
GlobalTrack [27]	CST	44.53	65.77	37.62	26.63	17.26	26.95	15.83	46.14	59.15	34.52	35.92

4.3 Difficulty Level of CST Anti-UAV

We conduct an experiment to further demonstrate the difficulty level of the CST Anti-UAV benchmark. We trained the 20 trackers on the existing most diverse anti-UAV tracking dataset, Anti-UAV410, and evaluated them on the CST Anti-UAV test set using the SA metric. The results of the ten selected methods are shown in Table 4. For the results of all trackers, please refer to the supplementary materials. We observe a severe performance degradation when training on previous datasets. Existing datasets are inherently limited, lacking the diversity and complexity required to address the challenges posed by the CST Anti-UAV benchmark. We further quantified performance improvements through relative percentage increases. The results showed that 19 out of 22 trackers improved by over 10%, with the highest growth reaching 30.3%. These suggest that the current datasets are insufficient to represent tiny UAV perception and outline the need for complex, diverse datasets like CST Anti-UAV to advance the field.

4.4 Coarse-to-fine Attribute Evaluation

To analyze the different challenges faced by trackers, all trackers were evaluated at both the sequence level and frame level based on six attributes and four object sizes on the CST Anti-UAV dataset, using state accuracy. We selected the results of the eight trackers for display, with Table 5 showing the frame-level results. The sequence-level and complete frame-level results are available, please refer to them in the supplementary materials.

Experimental results show that sequence-level evaluations fluctuate little compared to the significant variations seen at the frame level which arises from that not all frames within a sequence contain challenging conditions. To illustrate the distinction between sequence-level and frame-level attribute evaluations, we quantified the deviations of both sequence-level and frame-level attribute-specific performance from the overall performance. As shown in Figure 4, the frame-level exhibits analysis of more realistic variations. For attributes like scale variation and small size that span the entire test set, sequence-level analysis is equivalent to the overall performance and can even be misleading for occlusion and normal size. Thus, frame-level attribute annotations provide valuable guidance and actionable insights for improving tracker performance.

Based on the frame-level results, we observed a significant degradation in tracking performance for the attributes DBC, TC, CDB, OV, and tiny-sized objects. This indicates that tracking in complex scenarios and with tiny-sized objects remains a challenge for current trackers, demonstrating the effectiveness of the proposed CDB attribute. Among these, the OV attribute exhibits a pronounced bimodal distribution in performance, with peak performance at 65.77% and minimum performance dropping to 14.53%. This highlights the critical importance of re-localizing the target when it reappears. Occluded objects do not experience significant positional changes when reappearing, making them easier to locate than out-of-view ones. It is worth mentioning that with the help of the CST Anti-UAV training set, all evaluated trackers exhibit substantial performance improvements in complex scenarios and tiny object tracking. When addressing normal and large objects, most trackers achieve performance comparable to, or only marginally lower than that of trackers trained with Anti-UAV410.

Concretely, GlobalTrack [27] and SiamDT [26] still perform well across various attributes and object scales except for tiny objects, likely due to their global search modules. While this ensures strong target re-localization capability after the disappearance, the expanded search regions introduce more background noise, leading to degraded performance on tiny-sized objects. KYS [4] uses dense localized state vectors and a prediction module to propagate scene information across frames, enabling strong performance in tracking out-of-view attributes and tiny objects. PrDiMP [17] combines probabilistic regression and DiMP [3] to provide a probabilistic distribution prediction for the target’s location, which allows precise localization after occlusions and adapts to the various object scales. PrDiMP shows strong performance on challenges such as occlusions, out-of-view, and complex dynamic backgrounds, and achieves state-of-the-art results for tiny-sized objects.

4.5 Impact of Fewer Large-scale Objects

We further investigated the performance impact caused by a limited number of large-scale UAV objects, we evaluated trackers trained on CST Anti-UAV and those trained on Anti-UAV410 by testing them on the Anti-UAV410 test set, with the results shown in Figure 5. Notably, our dataset achieved comparable success rates and accuracy scores with only 837 large-scale objects and 11,208 normal-sized objects, remarkably smaller training samples compared to Anti-UAV410’s 196,735 large-scale objects and 97,942 normal-sized objects. This demonstrates that fewer large, prominent objects are sufficient to achieve good performance and further validates the effectiveness of the CST Anti-UAV dataset.

5 Conclusion

In this paper, we introduce the CST Anti-UAV dataset, a large-scale TIR benchmark specifically designed for UAV tracking. It consists of 220 videos with over 240k annotated bounding boxes, captured across diverse environments, times of day, and weather conditions. Notably, CST Anti-UAV is the first UAV tracking dataset to incorporate fully manual frame-level attribute annotations, addressing a critical gap in current benchmarks. The principal contributions of the CST Anti-UAV dataset include its focus on tiny UAV targets which are underrepresented in existing datasets, and the incorporation of multifaceted tracking scenarios spanning urban and wilderness environments, ensuring robust generalization. To validate its effectiveness, we retrained 20 recent SOT methods on both the CST Anti-UAV dataset and the existing largest anti-UAV tracking dataset, then conducted extensive experiments. The results show that the performance of these methods degrades significantly on our proposed dataset, and even the SOTA method struggles with tiny and inconspicuous objects in complex and dynamic backgrounds, highlighting the importance of the CST Anti-UAV for improving current UAV tracking methods. Overall, we believe the CST Anti-UAV benchmark will inspire the development of more robust UAV tracking methods and accelerate the deployment of reliable vision-based anti-UAV systems in the real world.

Acknowledgements. This work was supported in part by the National Natural Science Foundation of China (62222206, U2441241, 62401243, and 62272209), and the Jiangxi Province Natural Science Foundation (20242BAB212003).

References

Bai et al. [2024] Yifan Bai, Zeyang Zhao, Yihong Gong, and Xing Wei. Artrackv2: Prompting autoregressive tracker where to look and how to describe. In CVPR, 2024.
Bertinetto et al. [2016] Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, and Philip H. S. Torr. Fully-convolutional siamese networks for object tracking. In ECCV, page 850–865, 2016.
Bhat et al. [2019] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for tracking. In ICCV, pages 6182–6191, 2019.
Bhat et al. [2020] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Know your surroundings: Exploiting scene information for object tracking. In ECCV, 2020.
Cai et al. [2024] Wenrui Cai, Qingjie Liu, and Yunhong Wang. Hiptrack: Visual tracking with historical prompts. In CVPR, 2024.
Cai et al. [2023] Yidong Cai, Jie Liu, Jie Tang, and Gangshan Wu. Robust object modeling for visual tracking. In ICCV, pages 9589–9600, 2023.
Cao et al. [2021] Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li. Hift: Hierarchical feature transformer for aerial tracking. In ICCV, 2021.
Chen et al. [2022] Boyu Chen, Peixia Li, Lei Bai, Lei Qiao, Qiuhong Shen, Bo Li, Weihao Gan, Wei Wu, and Wanli Ouyang. Backbone is all your need: A simplified architecture for visual object tracking. ECCV, 2022.
Chen et al. [2021] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In CVPR, 2021.
Chen et al. [2023] Xin Chen, Houwen Peng, Dong Wang, Huchuan Lu, and Han Hu. Seqtrack: Sequence to sequence learning for visual object tracking. In CVPR, 2023.
Chen et al. [2020] Zedu Chen, Bineng Zhong, Guorong Li, Shengping Zhang, and Rongrong Ji. Siamese box adaptive network for visual tracking. In CVPR, pages 6668–6677, 2020.
Cheng et al. [2017] Hui Cheng, Lishan Lin, Zhuoqi Zheng, Yuwei Guan, and Zhongchang Liu. An autonomous vision-based target tracking system for rotorcraft unmanned aerial vehicles. In IROS, 2017.
Cui et al. [2022] Yutao Cui, Cheng Jiang, Limin Wang, and Gangshan Wu. Mixformer: End-to-end tracking with iterative mixed attention. In CVPR, pages 13598–13608, 2022.
Cui et al. [2023] Yutao Cui, Tianhui Song, Gangshan Wu, and Limin Wang. Mixformerv2: Efficient fully transformer tracking. In NIPS, 2023.
Danelljan et al. [2017] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. In CVPR, 2017.
Danelljan et al. [2019] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap maximization. In CVPR, 2019.
Danelljan et al. [2020] Martin Danelljan, Luc Van Gool, and Radu Timofte. Probabilistic regression for visual tracking. In CVPR, pages 7183–7192, 2020.
Du et al. [2020] Fei Du, Peng Liu, Wei Zhao, and Xianglong Tang. Correlation-guided attention for corner detection based visual tracking. In CVPR, 2020.
Fan et al. [2019] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In CVPR, pages 5369–5378, 2019.
Felsberg et al. [2015] Michael Felsberg, Amanda Berg, Gustav Hager, Jorgen Ahlberg, Matej Kristan, Jiri Matas, Ales Leonardis, Luka Cehovin, Gustavo Fernandez, Tomas Vojır, and et al. Nebehay. The thermal infrared visual object tracking vot-tir2015 challenge results. In ICCVW, 2015.
Felsberg et al. [2016] Michael Felsberg, Matej Kristan, Jiři Matas, Aleš Leonardis, Roman Pflugfelder, Gustav Häger, Amanda Berg, Abdelrahman Eldesokey, and et al. Ahlberg. The thermal infrared visual object tracking vot-tir2016 challenge results. In ECCVW, page 824–849, 2016.
Fu et al. [2021] Zhihong Fu, Qingjie Liu, Zehua Fu, and Yunhong Wang. Stmtrack: Template-free visual tracking with space-time memory networks. In CVPR, pages 13774–13783, 2021.
Gao et al. [2022] Shenyuan Gao, Chunluan Zhou, Chao Ma, Xinggang Wang, and Junsong Yuan. Aiatrack: Attention in attention for transformer visual tracking. In ECCV, pages 146–164. Springer, 2022.
Gao et al. [2023] Shenyuan Gao, Chunluan Zhou, and Jun Zhang. Generalized relation modeling for transformer tracking. In CVPR, pages 18686–18695, 2023.
Han et al. [2025] Xiaoling Han, Bin Lin, Zhenyu Na, Bowen Li, Chaoyue Zhang, and Ran Zhang. Spatial crowdsourcing-based task allocation for uav-assisted maritime data collection. TVT, 74(2):3375–3388, 2025.
Huang et al. [2023] Bo Huang, Jianan Li, Junjie Chen, Gang Wang, Jian Zhao, and Tingfa Xu. Anti-uav410: A thermal infrared benchmark and customized scheme for tracking drones in the wild. PAMI, 2023.
Huang et al. [2020] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Globaltrack: A simple and strong baseline for long-term tracking. AAAI, page 11037–11044, 2020.
Huang et al. [2021] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. PAMI, 43(5):1562–1577, 2021.
Jiang et al. [2023] Nan Jiang, Kuiran Wang, Xiaoke Peng, Xuehui Yu, Qiang Wang, Junliang Xing, Guorong Li, Guodong Guo, Qixiang Ye, Jianbin Jiao, Jian Zhao, and Zhenjun Han. Anti-uav: A large-scale benchmark for vision-based uav tracking. TMM, 25:486–500, 2023.
Kumar et al. [2025] V. D. Ambeth Kumar, Venkatesan Ramachandran, Mamoon Rashid, Abdul Rehman Javed, Shayla Islam, and Abdullah Al Hejaili. An intelligent traffic monitoring system in congested regions with prioritization for emergency vehicle using uav networks. Tsinghua Science and Technology, 30(4):1387–1400, 2025.
Lai et al. [2024] Simiao Lai, Chang Liu, Dong Wang, and Huchuan Lu. Refocus the attention for parameter-efficient thermal infrared object tracking. TNNLS, pages 1–12, 2024.
Li et al. [2019] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In CVPR, 2019.
Liang et al. [2015] Pengpeng Liang, Erik Blasch, and Haibin Ling. Encoding color information for visual tracking: Algorithms and benchmark. TIP, page 5630–5644, 2015.
Lin et al. [2022] Liting Lin, Heng Fan, Yong Xu, and Haibin Ling. Swintrack: A simple and strong baseline for transformer tracking. NIPS, 2022.
Liu et al. [2020a] Qiao Liu, Zhenyu He, Xin Li, and Yuan Zheng. Ptb-tir: A thermal infrared pedestrian tracking benchmark. TMM, page 666–675, 2020a.
Liu et al. [2020b] Qiao Liu, Xin Li, Zhenyu He, Chenglong Li, Jun Li, Zikun Zhou, Di Yuan, Jing Li, Kai Yang, Nana Fan, and Feng Zheng. Lsotb-tir: A large-scale high-diversity thermal infrared object tracking benchmark. In ACMMM, 2020b.
Mayer et al. [2021] Christoph Mayer, Martin Danelljan, Danda Pani Paudel, and Luc Van Gool. Learning target candidate association to keep track of what not to track. In ICCV, 2021.
Mayer et al. [2022] Christoph Mayer, Martin Danelljan, Goutam Bhat, Matthieu Paul, Danda Pani, Paudel Fisher, Yu Luc, and Van Gool. Transforming model prediction for tracking. In CVPR, 2022.
Müller et al. [2018] Matthias Müller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, page 310–327, 2018.
Nam and Han [2016] Hyeonseob Nam and Bohyung Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.
Park and Berg [2018] Eunbyung Park and Alexander C. Berg. Meta-tracker: Fast and robust online adaptation for visual object trackers. In ECCV, page 587–604, 2018.
Portmann et al. [2014] Jan Portmann, Simon Lynen, Margarita Chli, and Roland Siegwart. People detection and tracking from aerial thermal views. In ICRA, 2014.
Real et al. [2017] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. Cornell University - arXiv, 2017.
Shi et al. [2018] Xiufang Shi, Chaoqun Yang, Weige Xie, Chao Liang, Zhiguo Shi, and Jiming Chen. Anti-drone system with multiple surveillance technologies: Architecture, implementation, and challenges. IEEE Communications Magazine, 56(4):68–74, 2018.
Smeulders et al. [2014] Arnold W. M. Smeulders, Dung M. Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan, and Mubarak Shah. Visual tracking: An experimental survey. PAMI, 36(7):1442–1468, 2014.
Song et al. [2022] Zikai Song, Junqing Yu, Yi-PingPhoebe Chen, and Wei Yang. Transformer tracking with cyclic shifting window attention. CVPR, 2022.
Wang et al. [2019] Ning Wang, Yibing Song, Chao Ma, Wengang Zhou, Wei Liu, and Houqiang Li. Unsupervised deep tracking. In CVPR, 2019.
Wei et al. [2023] Xing Wei, Yifan Bai, Yongchao Zheng, Dahu Shi, and Yihong Gong. Autoregressive visual tracking. In CVPR, pages 9697–9706, 2023.
Wu et al. [2023] Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, and Antoni B. Chan. Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. In CVPR, 2023.
Wu et al. [2013] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In CVPR, 2013.
Wu et al. [2015] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. PAMI, page 1834–1848, 2015.
Wu et al. [2014] Zheng Wu, Nathan Fuller, Diane Theriault, and Margrit Betke. A thermal infrared video benchmark for visual analysis. In CVPRW, 2014.
Xie et al. [2022] Fei Xie, Chunyu Wang, Guangting Wang, Yue Cao, Wankou Yang, and Wenjun Zeng. Correlation-aware deep tracking. In CVPR, 2022.
Xie et al. [2024a] Fei Xie, Wankou Yang, Chunyu Wang, Lei Chu, Yue Cao, Chao Ma, and Wenjun Zeng. Correlation-embedded transformer tracking: A single-branch framework. PAMI, 2024a.
Xie et al. [2024b] Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, and Rongrong Ji. Autoregressive queries for adaptive tracking with spatio-temporal transformers. In CVPR, pages 19300–19309, 2024b.
Xu et al. [2020] Yinda Xu, Zeyu Wang, Zuoxin Li, Ye Yuan, and Gang Yu. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. AAAI, page 12549–12556, 2020.
Yan et al. [2021a] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for visual tracking. In ICCV, 2021a.
Yan et al. [2021b] Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In CVPR, 2021b.
Ye et al. [2022] Botao Ye, Hong Chang, Bingpeng Ma, Shiguang Shan, and Xilin Chen. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, 2022.
Yu et al. [2020] Yuechen Yu, Yilei Xiong, Weilin Huang, and Matthew R. Scott. Deformable siamese attention networks for visual object tracking. In CVPR, 2020.
Zhang et al. [2020] Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. Ocean: Object-aware anchor-free tracking. In ECCV, 2020.
Zhang et al. [2025] Zhen Zhang, Lehao Huang, Qingwang Wang, Linhuan Jiang, Yemao Qi, Shunyuan Wang, Tao Shen, Bo-Hui Tang, and Yanfeng Gu. Uav hyperspectral remote sensing image classification: A systematic review. JSTARS, 18:3099–3124, 2025.
Zhao et al. [2022] Jie Zhao, Jingshu Zhang, Dongdong Li, and Dong Wang. Dut-antiuav. TITS, 2022.