The English poet Alexander Pope famously wrote, “a little learning is a dangerous thing,” a pithy reminder that humans’ initial understanding of things tends to be incomplete, and additional learning is required before it’s possible to make informed decisions. Three hundred years later, the same could be said for deep learning models.
Large-scale training data is crucial for building good deep learning models — the more they get, the better they perform. In classical deep learning training frameworks, this training data is assumed to arrive all at once. In practice, however, data is usually streamed to the learner one batch at a time, a scenario that creates a natural trade-off between the accuracy of a model and the time required to train it. This trade-off is emblematic of anytime learning, a setting where a learner needs to provide good predictions at any point in time while also improving its performance over time as more and more data becomes available.
In the paper On Anytime Learning at Macroscale, a research team from Facebook AI Research and Mila – McGill University explores this accuracy versus time trade-off of anytime learning, which they term Anytime Learning at Macroscale (ALMA). The team notes that an eager model can produce non-trivial predictions by training on data batches as soon as they become available, but a model that patiently aggregates batches into a larger dataset will deliver improved accuracy. They conduct empirical evaluations on various models to gain insights on how to strike different trade-offs between accuracy and time to obtain the best learner.
The researchers summarize their study’s key contributions as:
- Formalize the ALMA problem and introduce metrics to evaluate learners.
- Conduct empirical evaluations of various models that strike different trade-offs between accuracy and time to obtain a useful predictor.
In the anytime learning at macroscale (ALMA) setting, data is assumed to be presented to the learner as a stream of consecutive batches of examples. The researchers also assume that the data arrival rate is slower than the model’s processing time, enabling the model to iterate over the data a number of times to improve its performance.
The team evaluated learners in the ALMA setting across three axes: accuracy, memory and computation. By measuring these quantities against time via the area under the curve, they can not only measure the model’s final performance but also the whole training trajectory over the sequence of large data batches.
The learning algorithms tested in the ALMA setting are Mixture of Experts (MoE) and Growing MoE (gMoE). MoE refers to methods where multiple experts (learners) are used to divide the problem space into homogeneous regions, while gMoE is a simple extension of the one layer MoE with an added temporal grow capability — i.e. it grows by adding one expert at each layer over time.
The team conducted experiments on the MNIST and CIFAR 10 datasets and a collection of English language texts comprising books, Wikipedia and Common Crawl to assess the effects of receiving data over time, and to determine which models strike the best trade-offs between time, accuracy, compute and memory usage.
In the MNIST experiment, the backbone architecture of the finetuning baseline is a three-layer fully connected neural network with ReLU units. For the CIFAR 10 experiment, the backbone architecture is a scaled-down version of a VGG19 convolutional neural network. For the language modelling task, the team used Switch Transformer.
The results show that models that update their parameters at an intermediate rate strike the best trade-off between accuracy and time, that bigger models generalize better, and that models that grow capacity over time can also generalize better, particularly when the initial model is smaller.
The team believes that because ALMA mimics learning scenarios for real-life applications where the goal is to efficiently solve a task even as more data is received for training, it can make an important contribution to anytime learning and help researchers obtain better models.
The paper On Anytime Learning at Macroscale is on arXiv.
Author: Hecate He | Editor: Michael Sarazen, Chain Zhang
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.