Visual object tracking involves estimating the trajectory of an object in a video sequence and has real-world applications in fields such as autonomous vehicles, robotics, and human-computer interfaces. While today’s ever-deeper and more complex neural networks have advanced the state-of-the-art in visual object tracking, relatively little work has been done on improving the efficiency of tracking architectures to boost visual object tracking performance.
In the new paper Efficient Visual Tracking with Exemplar Transformers, an ETH Zurich research team proposes Exemplar Transformers, a novel efficient transformer layer for real-time visual object tracking that’s up to 8x faster than other transformer-based models.

The team summarizes their main contributions as:
- We introduce an efficient transformer layer based on the use of a novel Exemplar Attention.
- We incorporate the proposed transformer layer into a Siamese-based tracking architecture and, thereby, significantly increase robustness with negligible effect on run-time.
- We present the first transformer-based tracking architecture that is capable of running in real-time on a CPU.
The proposed Exemplar Transformer is built on two hypotheses: 1) A small set of exemplar values can act as a shared memory between the samples of the dataset; 2) A coarser query representation to that of the input is sufficiently descriptive to utilize the exemplar representation.

The researchers propose Exemplar Attention, inspired by a generalization of the standard “Scaled Dot Product Attention,” as the key building block of the proposed Exemplar Transformer layer. While the original transformer’s self-attention scales quadratically with the image size or input sequence, the team redesigned their module’s operands based on the aforementioned hypotheses to decrease the number of feature vectors and achieve significant speedups.

The researchers incorporated their proposed transformer layer into a Siamese-based tracking architecture, E.T.Track, replacing the convolutional layers in the tracker head with the Exemplar Transformer layer. The increased expressive power of the Exemplar Transformer layer resulted in significant improvements in performance and robustness with a negligible effect on run-time.


The researchers compared E.T.Tracker with current state-of-the-art methods on six benchmark datasets: OTB-100, NFS, UAV-123], LaSOT, TrackingNet, and VOT2020. The proposed model achieved impressive performance in the evaluations, reaching AUC scores of 59.1 percent (a 2.2 percent gain over the popular DiMP tracker) and outperforming the mobile version of LightTrack by 3.7 percent. E.T.Tracker also trailed the complex transformer-based tracker TrSiam by only 2.2 percent in terms of precision, 2.32 percent in normalized precision and 3.12 percent in AUC while running almost 8x faster on a CPU.
The study shows that the proposed Exemplar Attention method can produce remarkable speedups and a significant decrease in cost, while the Exemplar Transformer layers can significantly improve the robustness of visual tracking models. The team believes E.T.Track is also the first transformer-based tracking architecture capable of running in real-time on computationally limited devices such as standard CPUs.
The paper Efficient Visual Tracking with Exemplar Transformers is on arXiv.
Author: Hecate He | Editor: Michael Sarazen

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: r/artificial - [R] ETH Zurich Proposes Exemplar Transformers: Robust Visual Tracking That’s 8x Faster and CPU-Compatible - Cyber Bharat