Community Conference Research

Google Brain’s Lukasz Kaiser: How Deep Learning Quietly revolutionized NLP

Lukasz Kaiser, Senior Research Scientist at Google Brain, gives a presentation about the developments in Natural Language Processing techniques at 2017's AI Frontier Conference.


On January 11th 2017, more than 20 world-class AI experts from both industry and academia came together in Santa Clara to participate in this year’s AI Frontiers Conference.

This year’s conference speakers include: Jeff Dean-Head of Google Brain, Li Deng-Chief AI Scientist at Microsoft, Adam Coates-Director of Baidu’s AI Lab, and Alex Smola-Director of Machine Learning at Amazon. The speakers shared cutting-edge developments on Artificial Intelligence to over 1500 attendees.

Lukasz Kaiser, Senior Research Scientist at Google Brain, gave a presentation on the developments in Natural Language Processing techniques and how well Google Translate work using NLP.

NLP – What are we talking about?


NLP or Natural Language Processing has changed tremendously in the recent years thanks to the developments of deep learning. NLP denotes a very broad term, encompassing speech, text, and the interaction between the two. For the purpose of his presentation, Lukasz talked specifically about text-to-text tasks, namely parsing, translation, language modeling, and etc.


These tasks seem easy (almost trivial) to anyone with a decent amount of education, but can a neural network achieve results similar to that of an average human? Most were skeptical until it actually happened, and when it did, everyone wanted to know how. How can a neural network understand a sentence? How can a neural network handle the complex tasks of language processing? Lukasz gave several explanations in the following slide:

ai-frontiers-conference-ppt2“When neural networks first came out, it’s built for image recognition to process inputs with the same dimension of pixels. Sentence are not the same as images.” Lukasz said.

The amount of words a sentence contains can and will vary significantly, meaning that the dimension of the input is completely irregular. In order for the neural network to accommodate this, RNN (recurrent neural network) is a natural choice. The next step is to directly train the network, because if too many steps were built, the calculation burden would be too significant. In later developments, the idea of LSTM (long-short term memory) became viable solution to this problem.

Advanced sequence-to-sequence LSTM


LSTM gave us the ability to train RNN. But in 1997, the developments of LSTM were plagued with issues: the sizes of LSTMs were too small and lacked proper hardware. The lack in supporting technology meant LSTM was only a theoretical breakthrough that people could not apply.

It was only in 2014 that LSTM became a viable application. Thanks to encoder-decoder architectures, now one can build not just a single layer, but a lot of layers of LSTM. After building the layers up, the larger model produced a much better result.


Lukasz gave an example on parsing. For humans, we have learned in school that in order to read a sentence, one needs to recognize verbs and nouns before looking at the grammar. This process is shown in the parsing tree below, and it is the old standard way to build a NLP model: first enter definitions, grammar, and sentence structure, then let the neural network train in order to understand and generate sentences.


But Lukasz’s research team had a different approach: just write the tree as a sentence of a sequence of the simplest way they can imagine using brackets and symbols.

Using this approach, the network is trained only by writing the sequence, without knowing anything about grammar trees and brackets, or having any background knowledge. However, the problem with this approach is the shortage of data, since all the data (sequence) were written by researchers. Compared to the old training method using grammar or sentence structure, the new method seems weaker in providing background knowledge. However, in practice, the new method works much better, because the network can learn all these knowledge by itself.

LSTM can also be applied to language models. The performance of language models is measured by perplexity: the lower the perplexity, the better the performance. Compared with past models, perplexity measurements are dropping rapidly, representing significant improvements. The best score achieved so far is 28 in 2016, compared to 67.6 in 2013. Results with such quality was once considered impossible. The decisive factor is model size.

Lukasz also gave some examples of applications of LSTMs in language modelling and sentence compression.


The most impressive improvement brought by LSTM is in the area of translation. Lukasz made a comparison: at school, we learn foreign language word by word. But what if we just learn through listening to people talk using that language? This is actually how children learn. As it turns out, this is how neural network learns as well. In this case, the size and number of training data are key factors.

In the graph below, translation performance is measured in BLEU scores, with higher score representing better performance. In the last two years, we have seen an improvement from 20.7 to 26.0. According to Lukasz, model size and tuning seem to play the decisive roles here as well.


About two years ago, neural networks were still trained to match the level of “hand-made systems” — a phrase system that translated phrase by phrase. By comparing the results of PBMT ( old standard translation model) and GNMT( new model involving LSTM) translating a German sentence, we can see that the new model’s result is obviously clearer and more understandable.

This result showed that the translation process doesn’t require much hand engineering, just a big network and a lot of training. “This theory also holds true for many NLP tasks.” said Lukasz.

But how good is it exactly? Can we quantify it?

With the recently launched neural network version of Google Translate, people were asked to evaluate translation results on a scale from 0 to 6 (0 means nonsense and 6 means perfect translation). Also to compare the results between old and new systems, Lukasz’s team asked human translators (people that speak the language but are not professional linguists) to join the match, and added their results to the evaluation as well. The slide below shows the resulting scores for all three translation systems.


Based on the results, we can see that the new system made huge improvements, and in some cases (English to Spanish) were almost as good as human translator. Obviously, studying larger databases helped to give better results.


Limitation of LSTM


There are still problems to be solved with sequence-to-sequence LSTM. Lukasz listed two of them:

1) Speed limitation

These are large models, and due to their dependence on database size, a significant amount of calculations are involved. Processing speed becomes a big problem. In order to shorten process time, TPU is an important hardware that helps researchers serve these translations.

Also, since the translation process is very sequential, even if calculations are done quickly, it is still generating the result word by word. Process time can be slow even for a small sized task. To solve this problem, new and more parallel models (Neural GPU, ByteNet) are expected to help.

2) Requirement of Data

Sequence to sequence LSTMs require a lot of data. To solve this issue, attention and other new architectures that increase data efficiency is suggested. Another approach could be the use of regularizers such as dropout, confidence penalty and layer normalization.

Also, since translation process is very sequential, even if calculations are done quickly, it is still generating results word by word. Process time can be slow even for a small-size task. To solve this problem, new and more parallel models (Neural GPU, ByteNet) are expected to help.



Deep learning has profoundly changed the field of NLP, and the introduction of sequence-to-sequence LSTMs yielded state-of-the-art results on many NLP tasks. Using LSTMs in production, Google Translate has achieved huge improvements in the quality of machine translation. However, new models are still required to address problems with LSTMs, especially regarding processing speed and dependency on large amount of data.


Original Article from Synced China | Analyst: Shaoyou Lu | Localized by Synced Global Team : Xiang Chen, Meghan Han

0 comments on “Google Brain’s Lukasz Kaiser: How Deep Learning Quietly revolutionized NLP

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: