## 1 Introduction

Deep Learning has become an essential toolbox which is used in a wide variety of applications, research labs, industries, etc. In this tutorial given at NIPS 2017, the speakers provide a set of guidelines which will help newcomers to the field understand the most recent and advanced models and their application to diverse data modalities.

## 2 Practice

It is common to think of deep learning as a toolbox enabler, with a rich source of papers, source codes, and tutorials available to people interested in deep learning. Generally, users may have the options to decide the aspects of their neural network architecture at the model level, i.e., what are the inputs and outputs, what is the task and how does the network optimize to perform the task. The users can put these together and obtain a working model. However, zooming out from the model level, there are also some important decisions to be made even prior to constructing the model. These include the following issues:

- Platform
- How to deploy the model?
- What will the model be trained using, i.e., GPUs vs CPUs?

- Framework
- What are the differences between the available frameworks, and which one to select?
- What are the limitations of the selected framework?
- Is the selected framework suitable for the platform of interest?

- Dataset
- There is a vast amount of datasets to work with, which is most appropriate?
- What does the dataset look like, how big are the dimensions, etc.

Once these decisions are made, the user can zoom in on the model level, and focus on decisions which impact the neural network architecture. These include the following decisions:

**Activations**: Which non-linearities should be chosen, i.e., ReLu, sigmoid, tanh, GRU, etc.**Algorithms**: Which optimizer should be chosen, i.e., SGD, Momentum, Adam, etc.**Connectivity patterns**: Which type of connection does the neural network take on, i.e., fully connected, convolutional, recurrent, recursive, etc.**Loss function**: Which type of loss function should the network optimize, i.e., cross entropy, MSE, adversarial, etc.**Hyperparameters**: These include learning rate, layer size, batch size, dropout rate, weight initialization, etc.

These are all important decisions that go into the process of building a neural network. Nando de Freitas categorizes these decisions into three main components: Inputs and Outputs, Architectures, and Losses.

### 2.1 input and outputs (I/O)

The most common I/Os are in the form of vectors. The elements of these vectors are often the attributes of interest in the data. These vectors are often weakly structured, i.e., elements corresponding to different attributes may take on different types of data, or vary by orders of magnitude.

Images are also an important type of I/Os. Images have much higher dimensions than vectorized inputs in general, and can be used for a wide range of applications, i.e., classification, segmentation, generative models, art, etc.

Another type of I/O is sequences. Some common sequences include words/letters, speech, videos, sequential decision making, etc. Sequences are in a sense, extension of images.

### 2.2 Architectures

Here, Nando de Freitas explains three key building blocks that are heavily used in deep learning. All of the architectures discussed have a common characteristic: they have the correct inductive biases. As commonly known, deep learning often tries to avoid hand-tuned, or engineered features. However, it is beneficial to have correct inductive biases induced into the architectures. This concept of inductive biases will be important in the following subsections.

**2.2.1 Convolutional Networks**

Convolutional neural networks (ConvNet) have been around for quite a long time, and are very similar to ordinary neural networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. So what is the difference between a ConvNet and ordinary neural network? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.

With the explicit assumption on the inputs, the key inductive bias that convolutional neural networks use is invariance. There are two types of invariances which are of interest when dealing with images, locality and translation invariance. Locality means that pixels nearby are correlated. Translational invariance refers the appearance of objects being independent of location. By incorporating locality as an assumption, the architecture may go from fully-connected to locally-connected, therefore reducing computation without losing much information. The second assumption of translational invariance says that for example whether an object is on the top-left or bottom-right, the network will use the same filter to analyze the image (the weight matrix of a convolutional layer is often called a convolution kernel, or filter).

The ImageNet challenge was a highlight of ConvNets. In 2012, ConvNets provided the first major classification error improvement from 0.26 to 0.16 (the previous improvement was 0.28 to 0.26 using ordinary neural nets). By 2016, ConvNets had achieved a classification error rate of 0.03. ConvNets have thus since become the standard for processing images. Figure 1. shows ConvNets progress in the past 8 years.

Here, the reviewer will leave readers with several classic ConvNets to look up if interested.

- AlexNet: The first work that popularized Convolutional Networks in Computer Vision was the AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton (2012).
- ZF Net: The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus.
- GoogLeNet: The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google.

Training a ConvNet is not an easy task. The depth (number of layers) of a ConvNet is an important design decision. Computation complexity is the main bottleneck produced by adding additional layers. In theory, convolution can be parallelized, but not depth. As a result, ConvNets will become slower with increasing depth. One way to tackle this is to use smaller convolutions (current state of the art ConvNets almost always use 3×3 convolutions). For example with a 7×7 pixel image, instead of performing a single convolution of the image with itself (7×7 convolution), compute 25 3×3 convolutions.

**2.2.2 Recurrence Networks **

Recurrent neural networks (RecNets) are popular architectures that have shown great promise in many natural language processing (NLP) tasks. There are two key ingredients when processing languages, neural embeddings and recurrent language models. The discovery of the two ingredients is the reason why deep learning has been able to develop into a tool box for efficient NLP.

The consequences of embedding vectors gives rise to the encoder-decoder paradigm. An encoder-decoder framework is one where the encoder encodes the input word and the decoder produces a target word. In addition, the key insight of neural embedding is that a word can be represented as a one-hot encoding. This allows systems to take text and convert it into a vector, and define a vector space representation. Recurrent language models on the other hand, have been (empirically) shown to outperform other language processing modeling approaches. As mentioned, the key insight here is the vectorization of context. The idea is that each word of a sequence of words is one-hot encoded, and used as input to the network which predicts the next word in the sequence. The next word is then calculated to be the one with the highest likelihood, Prob(w_t | w_1, w_2, …, w_(t-1)), according to some model. The problem here is that the system must keep track of all the previous words, or there must be a predetermined, fixed number of words to keep in memory. RecNets solve this issue in a very natural way.

RecNets embed the words one at a time, but each word has (an unfixed number of) hidden states which are continuously updated. Thus, the RecNet takes the word embedding (multiplied by a matrix) and the previous state (multiplied by another matrix), sum these two vectors together, apply a non-linearity and therefore, defines the next hidden state. From the hidden states, the RecNet can predict the next word. Due to the flexibility in the number of hidden states to keep in memory, RecNets have more invariance. RecNets are currently the state-of-art of language models in terms of the performance they achieve on test sets.

**2.2.3 Recurrence Networks with Attention**

A slight extension to RecNets for language models is a network which can read in a sequence of words and output another sequence of words. The main idea is that, instead of generating word-by-word, the network will generate an output from a sequence of words (and hidden states). For instance, the network will take a sequence of French words, read it all in, and then start generating the translation in English. The sequence-to-sequence (Seq2Seq) framework also relies on the encoder-decoder paradigm, where the encoder encodes a sequence and the decoder outputs a sequence.

This simple idea along with the use of RecNets has become the cornerstone of machine translation and has be shown to generate a lot of success. Figure 2. displays the progress of RecNet-based machine translation models’ performance in the past 4 years. Machine translation is measured using BLEU (bilingual evaluation understudy), an algorithm for evaluating the quality of text which has been machine translated from one natural language to another.

A performance increase can be seen in RecNets for machine translation which eventually outperform traditional statistical models (Moses SMT) and state-of-the-art (SOTA) statistical models.

The reviewer will refer interested readers to papers proposing the idea of Seq2Seq type language processing.

- Auli, M., et al. “Joint Language and Translation Modeling with Recurrent Neural Networks.” EMNLP (2013)
- Cho, K., et al. “Learning Phrase Representations using RNN Encoder-Decoder for Statistical MT.” EMNLP (2014)
- Sutskever, I., et al. “Sequence to Sequence Learning with Neural Networks.” NIPS (2014)
- Bahdanau, D., et al. “Neural Machine Translation by Jointly Learning to Align and Translate” ICRL (2015)

Recently, there has been further development in RecNets for machine translation by combining the concept of attention with RecNets. However, it has been observed that fixed size embeddings are easily overwhelmed by long inputs or long outputs, leading to a decrease in performance.

Attention relieves this bottleneck. Attention is a mechanism that forces the model to learn to focus on specific parts of the input sequence when decoding, instead of relying solely on the hidden states. The model now includes a “context” vector at the input, where the context vector computes a weight for each hidden state of the encoder. Intuitively, this allows the decoder to predict the output based on the most relevant hidden states, thereby effectively reducing the length of the input sequence. The Bahdanau paper above goes into more detail on this.

Finally, the reviewer will leave the reader with some references regarding attention.

- Luong, M.-T. et al. “Effective Approaches to Attention-based Neural Machine Translation.” EMNLP (2015)
- Xu, K., et al. “Show, attend and tell: Neural Image Caption Generation with Visual Attention.” ICML (2015)
- Andrychowicz, et al. “Learning Efficient Algorithms with Hierarchical Attentive Memory.” arXiv preprint arXiv:1602.03218 (2016)

### 2.3 LOSSES

Loss functions are functions that map an event or values of one or more variables into a real number to intuitively represent some “cost” associated with the event. Depending on the I/Os and the architecture of the system, the loss function specified may be different. Nando lists the common loss functions for the architectures which were discussed. For the convolutional architectures which perform classification, the loss function that is commonly used is the softmax cross-entropy which may add an L2 norm regularization term. When it comes to the recurrent architectures, the common loss functions used are the softmax cross entropy for discrete cases, and Gaussian (mixture) likelihood models for continuous cases.

## 3 Summary

Freitas introduced the current practices of deep learning in the applications of image processing and natural language processing. Freitas discussed the progress made in image processing, with a focus on ConvNet architectures. Despite their success, there are still some challenges regarding computational complexity and optimization techniques in training deep ConvNets. These are still very open research areas. As for the domain of natural language processing, current practice is to combine the use of RecNets and the attention mechanism to achieve optimal performance. For machine translation applications, the performance of current best models still has lots of room for improvement. We are excited to see how much progress can be made in near future.

**Author:** Joshua Chou | **Editor:** Hao Wang, Michael Sarazen

## 0 comments on “Talk Review: Deep Learning: Practice and Trends, NIPS 2017 – Part I”