AI Machine Learning & Data Science Research

ImageNet-1K Compressed 20x with Exceptional 60.8% Accuracy by MBZUAI & CMU’s Data Condensation Method

In recent years, data compression or distillation approaches have garnered widespread attention. By compressing large-scale datasets into compact, representative subsets, these methods facilitate rapid model training, efficient data storage, and the preservation of vital information from the original dataset. However, existing solutions primarily excel at compressing low-resolution smaller datasets due to significant computational overhead demands.

In a new paper Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective, a research team from Mohamed bin Zayed University of AI and Carnegie Mellon University presents Squeeze, Recover and Relabel (SRe^2L), a new dataset condensation framework that is capable of distilling large-scale high-resolution datasets. Remarkably, it compresses the original 1.2 million data samples from ImageNet-1K to 20 times smaller, while still achieving Top-1 accuracy of 60.8%.

The team summarizes their main contributions as follows:

  1. We propose a new framework for large-scale dataset condensation, which involves a three-stage learning procedure of squeezing, recovery, and relabeling.
  2. We conduct a thorough ablation study and analysis, encompassing the impacts of diverse data augmentations for original data compression, various regularization terms for data recovery, and diverse teacher alternatives for relabeling on the condensed dataset.
  3. To the best of our knowledge, this is the first work that enables to condense the full ImageNet-1K dataset with an inaugural implementation at a standard resolution of 224×224, utilizing widely accessible NVIDIA GPUs such as the 3090, 4090, or A100 series.

The main challenge of the dataset distillation task lies in designing a generation algorithm that can effectively produce the required samples while ensuring that the generated samples contain or retain the key information from the original dataset. Existing approaches perform poorly to scale up large-scale dataset due to considerable computational and memory overhead, which make it challenging to preserve the needed information.

To address these issue, the proposed SRe^2L decouples the bilevel optimization of model and synthetic data during training, making the process of extracting information from the original data independent of the data generation process. This approach not only avoids the need for additional memory but also prevents biases from the original data affecting the generated data when both are processed simultaneously.

Specifically, the framework consists of Squeeze, Recover, and Relabel stages: 1) the team first trains a model to accommodate the crucial information from the original dataset; 2) next they perform a recovery process to synthesize the target data; 3) finally they relabel the synthetic data to reflect the true label of the data.

In their empirical study, the researchers conducted extensive data condensation experiments on Tiny-ImageNet and ImageNet-1K datasets. SRe^2L achieves the highest 42.5% and 60.8% accuracy on full Tiny-ImageNet ImageNet-1K while using justified training time and memory cost, outperforming all previous state-of-the-art results by large margins of 14.5% and 32.9% respectively.

The code is available on project’s GitHub. The paper Squeeze, Recover and Relabel: Dataset Condensation at ImageNet Scale From A New Perspective on arXiv.


Author: Hecate He | Editor: Chain Zhang


We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

1 comment on “ImageNet-1K Compressed 20x with Exceptional 60.8% Accuracy by MBZUAI & CMU’s Data Condensation Method

  1. Maria Ward

    comprehensive and interesting news. Thank you for the publication. In turn, I can recommend a useful localization service, you can check it out here

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: