Large training datasets have become integral to the field of machine learning, serving as the bedrock for recent breakthroughs in language modeling and multimodal learning. Despite their pivotal role, these datasets are seldom the focal point of active research, with many large-scale training sets remaining unreleased. This lack of accessibility hinders consistent dataset evaluation and the establishment of reproducible frameworks.
In a new paper Data Filtering Networks, a research team from Apple and University of Washington introduces the concept of data filtering networks (DFNs). These neural networks, specifically designed for data filtration, demonstrate the capacity to generate extensive, high-quality pre-training datasets efficiently. Notably, DFNs can be trained from scratch and improved using the same techniques as standard machine learning models.

The core focus of this work is on dataset filtering, assuming the existence of a large uncurated dataset. The research team highlights three key contributions:
- Characterizing the properties of data filtering networks that lead to high-quality datasets. The team investigates the impact of different properties of DFNs on the quality of training data and finds that a small contrastive image-text model trained solely on high-quality data can construct state-of-the-art datasets.
- Utilizing these properties to train DFNs and construct datasets that induce Contrastive Image-Text Pre-trained (CLIP) models, showcasing superior accuracy and a more favorable tradeoff between compute resources and accuracy compared to existing datasets in the literature.
- Offering insights that serve as a recipe to construct high-quality datasets from scratch using only public data, contributing to the democratization of large, high-quality datasets.

The ultimate goal of this research is to develop efficient functions capable of filtering potentially trillions of examples. The team refers to the dataset constructed by filtering a given pool with a DFN as the induced dataset, and the model trained exclusively on that dataset as the induced model. Employing a CLIP model as a DFN, the team defines its filtering performance based on the induced model’s evaluation on standard benchmarks.
To enhance the DFN, the team initiates the process by training a CLIP model on a high-quality dataset. Subsequently, they fine-tune the filtering network on additional datasets, incorporating standard machine learning techniques such as augmentation, distinct initialization, and extended training with a larger batch size for further refinement.

In their empirical study, the team demonstrates that the best performing dataset, DFN-5B, empowers the training of state-of-the-art CLIP models within specified compute budgets. Among various improvements, a ViT-H model trained on this dataset achieves an impressive 84.4% zero-shot transfer accuracy on ImageNet.
In summary, this innovative research on data filtering networks opens new avenues for the efficient creation of high-quality datasets from public data, contributing significantly to the democratization of large-scale datasets and advancing the capabilities of machine learning models.
The paper Data Filtering Networks on arXiv.
Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
0 comments on “Democratizing Data: How Apple and UW’s Data Filtering Networks Redefine Large-Scale Training Sets”