Graham Taylor from the University of Guelph gave a talk at the University of Toronto, summarizing current techniques used to address the issue of insufficient labeled data. He brought up two new perspectives of viewing the problem and introduced corresponding models applying those ideas.
Not surprisingly, the greatest advancements in deep learning emerged from fields with large labeled datasets. We can hardly imagine computer vision field without CIFAR or speech field without TIMIT. The behind-the-scenes effort of these datasets (like millions of manually labelled samples) is vital to the success of the models. But still, the lack of a large amount of labelled data is limiting the applicability of many deep learning models in specific applications, because in most cases, generic datasets cannot be used for training specific domains. If you ask deep learning researchers “what is your most desired Christmas gift”, most of them will shout out “labeled dataset that perfectly suits my need” without hesitation. In a recent talk at the University of Toronto, Graham Taylor with his data augmentation models is like Santa with his enormous bag of gifts.
Graham proposed two ways of overcoming the data obstacle: expanding the dataset, or use the existing one more efficiently. In the talk, he summarized the existing approaches into the following:
- data augmentation
Pre-training belongs to the “efficiency” category. It uses large generic dataset to pre-train your model (even not necessarily in the domain you’re interested in), then use the small dataset to retrain the top layer or fine-tune the whole network. This method is useful if your domain is not too special. Take the vision field for example: If you want to classify microscope or remote sensing images, don’t expect great results pretraining on ImageNet.
Synthesizing means using a rendering engine (like those scenery generators in video games) to expand the dataset by simulating labelled data. In theory, we can obtain infinite amount of synthetic data as long as the simulated system is a good match to the domain of interest.
Data augmentation has been commonly used since the beginning of deep learning. As early as 1998, LeCun did some transformations to input images to make LeNet5 more robust. The idea of data augmentation is quite simple: put human understanding of image into manually coded algorithms to crop, scale, rotate and/or add noise to data at the level of input. But this method is ubiquitous in vision, we can’t apply it to other domains.
On the basis of these works, Graham brought two new ideas tackling the lacking of data, one in efficient usage of current data, the other in data augmentation.
Idea No. 1: Mental Rotation
Idea No. 1 comes from an experiment in the 70s, in which Shepard and Metzler showed that the human response time to transformation in data (like rotations to images) is linearly related to the amount of change (like degrees of rotation). When humans do tasks like classification or clustering, we would transform objects towards each other then make comparison to see if they’re similar. Thus, such transformations should be allowed and implemented when we train our machine to do the same job.
But not all transformations are welcomed. Today, neural networks are powerful enough to morph one image into another even when they share no similarity. We want to limit the possible transformations to prevent the models from magically turning one image into another. Graham introduced the factor gated RBM (fgRBM) to implement the idea.
The model consists of two parts: a transformational relational model that learns the manifold of “mental rotation”, and a similarity learning model that takes pairs of input and assesses their perceptual similarity in some feature space.
The model is used together with a KNN to solve classification problems. You can either use the model to generate more training samples for KNN, or to replace the Euclidean distance in KNN with transformational distance. Two experiments, one in identity recognition and one in rotation invariant classification are used to validate the method. Results show significant improvement in accuracy – you can reduce the database by more than 70% and still reach a similar accuracy as the original KNN.
Idea No. 2: Back to Augmentation
The second idea is to migrate traditional techniques like interpolation, extrapolation and adding noise to high dimensional data.
The difficulty here lies in the fact that most of the high dimensional data manifolds are “highly twisted and curved, and only occupy a small volume of the input space” (Ozair, 2014). How can we smooth out and increase the volume of those manifolds? The intuition is to reduce the dimension of data. How can we reduce the dimension without losing necessary information? In an autoencoder, the hidden units that connect the encoder with the decoder, named context vectors, showed a brilliant example of capturing everything necessary with a much lower dimension. So why don’t we carry out data augmentation in feature spaces instead of input spaces?
Graham used a seq2seq autoencoder to generate more context vectors with techniques like interpolation, extrapolation and noise adding. These vectors can be directly used as inputs of training tasks. You can also decode the vectors back to input space and show them as new samples of the datasets. This is quite interesting as it can be regarded as both an application of unsupervised learning and a potential way to evaluate unsupervised learning algorithms.
To summarize this talk, Graham Taylor introduced many methods to deal with the insufficient data problem plaguing deep learning researchers and developers. Before saving the world, let’s first save ourselves from the anxiety of data shortages!
Analyst: Luna Qiu | Localized by Synced Global Team : Xiang Chen