Neural network-based approaches have made significant progress on video compression over the last several years, reaching performance on par with classical codec-based methods. These novel neural approaches however are challenging to implement, as they tend to require complex hand-crafted connections between their many sub-components and struggle when the input data does not match their architectural biases and priors.
In the new paper VCT: A Video Compression Transformer, a Google Research team presents an “elegantly simple” but powerful video compression transformer (VCT) that eliminates the architectural biases and priors of previous approaches (such as motion prediction and warping operations), and instead learns totally from data without any hand-crafting. VCT is easy to implement and outperforms existing video compression methods on standard datasets.
The proposed VCT is based on the original language translation transformer (Vaswani et al., 2017) and is tasked with translating the previous two frames of a video input into the current frame. It first uses lossy transform coding to project frames from the image space to quantized representations. A transformer then leverages temporal redundancies to model the representation distributions. These predicted distributions are then used to compress the quantized representations via entropy coding.
In their empirical studies, the team trained VCT on one million Internet video clips and compared it to video compression approaches such as the classical HEVC (High-Efficiency Video Coding) and neural methods such as SSF (Scale-Space Flow, Agustsson et al., 2020) and ELF-VC (Efficient Learned Flexible-Rate Video Coding, Rippel et al., 2021). The evaluations were conducted on the MCL-JCV and UVG benchmark datasets, with PSNR (peak signal-to-noise ratio) and MS-SSIM (multi-scale structural similarity index for motion detection) as metrics.
Despite its simplicity — and not using flow prediction, warping or residual compensation — VCT surpassed all methods in both PSNR and MS-SSIM in the evaluations. Moreover, experiments on synthetic data showed that VCT can also learn to handle complex motion patterns such as panning, blurring and fading, purely from data.
The team says VCT can reduce bandwidth requirements for video conferencing and streaming and enable better utilization of storage space, and hope it can serve as a foundation for a new generation of video codecs.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.