Doctor Lei Li is the chief scientist and the director at Toutiao Lab. He is a former scientist from Baidu Deep Learning Lab located in the United States. He has received a Bachelor of Science in Computer Science from Shanghai Jiao Tong University, and a doctoral degree in Computer Science from Carnegie Mellon University. His thesis was awarded to be one of the best doctoral theses by ACM SIGKDD. He also worked previously for Microsoft Research, Google, IBM TJ Watson, and University of California at Berkeley. Up to now, he has published more than 30 articles in machine learning and natural language understanding at international conferences, and owned thee American technology patents. Recently, Synced conducted a special interview with Dr. Li.
Below is Dr. Li’s speech on machine understanding language:
Good afternoon everyone. It is my honour to be here and talk to professionals and experts in various fields about what artificial intelligence can do for natural language understanding. I am going to explain how to apply machine learning to natural language understanding, chatting with human users, and producing machine-written news articles. In order to complete these tasks, we need to know what machine learning tools and basic algorithm models are needed for. After that, I will talk about how we realized the technology. At the end, we will discuss artificial intelligence in general, including the current artificial intelligence technologies, the challenges we facec, and some of my thoughts.
At the beginning of 2016, Google DeepMind uses their program named AlphaGO to show that, machine learning has already reached or surpassed human beings in performing certain tasks. How did AlphaGO learn to play Go? Its algorithms consist of two parts: one from deep learning, and one from reinforcement learning or Monte Carlo tree search. I wil focus on the part relating to deep learning. Based on the achievements and development of neural networks and deep learning in the past twenty or thirty years, we know that deep learning is very good at solving problems like playing Go. This kind of problems is called supervised learning. What is supervised learning? It is like give data X, and hope you can use it to do some prediction, called Y. You need to use machine learning methods to find the mapping function that maps X to Y.
Let’s take an image to be the input, then our output is the label of this image, which will classify the content of this image. Is it a cat or a dog? This is an image classification problem. If input is an audio recording of a Chinese sentence, the output should be the audio of the corresponding English sentence. The process of converting Chinese to English is also machine translation, which is under supervised learning. The third example is that when we input an image, and then the program produces a text passage to describe it. When we were little, we all learned how to describe the pictures we saw. In theory, machines can do the same thing. We can model the scenario to be a supervised problem. The fourth example is that, we input an audio recording, and the output is the text corresponding to the audio. This is called speech recognition, which is also a supervised learning problem. We can also think about this problem from the opposite approach. Let our input be a passage of text, and make the output a corresponding audio. This is called speech synthesis, under supervised learning. As long as there are enough data and appropriate models, deep learning can do a great job in solving these kind of problems.
Yet, how did deep learning do this? The way how deep learning or artificial neural networks work is inspired by human brains. Human brains have huge amount of neurons. Each neuron can only do very simple tasks. However, they can do complicated things as a whole once they are connected.
Inspired by the neurons in human brain, the pioneers in artificial intelligence created artificial neurons. Artificial neurons also have some inputs. The input is processed by some nonlinear functions, and then output the result. These artificial neurons can accomplish complicated tasks if they are connected in certain ways.
For instance, if the input is an image, and the task is to identify the number from the image. Here is a single hidden layer of a neural network. You can increase the amount of the hidden layers to improve the recognition ability of this network.
How does these deep learning techniques relate to our company, Toutiao? Toutiao is a news platform that provides users the latest news on technology. The platform has three crucial production procedures: 1) how to produce high-quality content, 2) how to send these content to users that are interested, and 3) how to encourage users to make comments after they are done reading an article or watching a video. The core technology behind these three tasks all require the input of artificial intelligence.
Today, I will talk about two major parts out of the three, including content production and content discussion — how we build machines to discuss content with users and how robots produce articles automatically. The major problem we have is language, which is quite different from the image problem I mentioned previously. The input of an image has a fixed size, while language does not. One sentence can be long or short. So here is the first problem: how to handle longer inputs? The deep learning models we build can handle longer input. Our initial idea is to increase memory units. In this model, there are some units for recording history information, which can remember information for a longer period of time, and make predictions.
Figure 1.Recurrent Neural Networks
For instance, here is a very simple recurrent neural network. Its input is X1, X2, X3 and X4, and each input is a vector. Its output, h, is what makes recurrent neural network different from the traditional convolutional neural network. Each h corresponds to not only the input of the current position, but also that of the previous position. In this way, all history information will be connected via a simple form. There is another form called Gated Recurrent Unit, which is a bit more complicated. Similar to adding switch to a human brain learning, this switch allows the system to memorize and forget information selectively. For instance, one of the switch is called Reset Gate, allowing the system to forget information selectively. There is another switch for controlling output, helping the system to control what information is from previous section at what position, and what information need to be saved for the next position.
Via the neural networks with memory units, chatbots are capable to communicate with users automatically. For instance, here is a short passage of one of the conversations our chatbot had.
Our chatbot can not only chat with users, but also determine the content’s sentiment. Even if the input is very long, the system can still give a relatively correct response. Here is an excerpt of some famous movie lines, and the machine does give out a positive response.
How does the system generate response for the conversation? Here is a simple demonstration: starts with a recurrent neural network that has an initial state, which is the yellow rectangle in the picture. The initial state will take in the hidden information in the current state, predict what text should be output from the state, and then uses that new output information to generate the next state. After we have the predicted text, we use this information as the needed information for the input of the next state. This way, the system will keep generate the second, third, fourth words, until it reaches the end of the sentence.
That is a scenario that we do not need to worry about context. What if we have contexts and the system needs to respond to contexts? We can use recurrent neural network to build a model based on the previous sentence — the context. Each word in the previous sentence is treated as a vector. By putting all of the vectors together and processing them as a whole, the entire sentence will become a vector, which will be the input of the initial state in the recurrent neural network of the response. Then, the system will use the same method I mentioned above to generate words to form a complete response.
Our robots can not only generate conversations, but also add sentiments to them. How does emotions come into play? We add extra excitation into the model, and this excitation can be an emotion excitation. If we want the conversation to be happy, angry or sad, we can add corresponding emotion excitation into the system. After that, the system will generate content with that specific emotion tone. However, what we have now is only a chatbot, it cannot answer questions that require more knowledge background.
The second thing I want to talk about is how to build a model, which enables the robot to answer deeper questions. In order to present knowledge in a way the computer can understand, we need a structured method. Use David Beckham as an example. We can format all the knowledge related to him in a graph. Each dot in the graph is an entity. There are some edges between entities and dots, corresponding to the relationship between entities. For example, Beckham being born in Leytonstone is a piece of knowledge about Beckham. We can put this information into a triad: the main entity, the guest entity, and the relationship between them.
How does the machine answer questions automatically? We need to find the corresponding knowledge triad in the knowledge database to answer questions like “Where was Beckham born?” The triad will be <DavidBeckham, PlaceOfBirth, Leytonstone>, and the answer will be “Leytonstone.”
Why is it difficult for the computer to answer questions like this? The first challenge is that our language is very complicated. We can ask the same question in various ways. For instance, “Where was President Obama born” is the same as “What is President Obama’s place of birth?” This is the variety and complexity of our language. The second problem is ambiguity. Many entities in our database have the same or similar names. For example, if we ask “Who is Michael Jordan?”, some people might think of the famous basketball player Michael Jordan. But for the professionals in machine learning, there is also an expert called Michael Jordan in the field. The ambiguity caused by pronouns can be the second challenge. The third difficulty is from the data sparseness. We do have huge amount of data. After we filter the data by the needed triad, there are still 22 million knowledge triads. These data are from Google FreeBase. Overall, we have 0.1 million marked question-and-answer pairs. However, it is very difficult to use these 0.1 million marked data to answer questions based on the 22 million triads.
Recently we built a CFO system, which is a deep learning system that can answer more complicated questions, such as “Where did Harry Potter go to school?” We all know the answer is Hogwarts. However, beside Hogwarts, Harry also went to an elementary school, of which a lot of people did not know. Our system is able to come out with both answers.
We did an evaluation of the system using Facebook’s open data sets. The result showed that our accuracy has reached 75.7%, which is 10% higher than the accuracy achieved by Facebook’s 62.9%. How did we complete the task with such high accuracy? And how did we build this kind of chatbot?
We can start with a question, “Who created the character Harry Potter?” First, we need to determine what the key entities are in this question. “Harry Potter” is one entity. Second, we need to determine what relationship this question is looking for, which is “Character_Created_By” in this case. These two pieces of information can be used to find the corresponding answer to this question from the database.
We use sequence labeling, a deep learning model, to give all potential entities scores. By this scoring method, “David Beckham” is found to be the entity with highest probability for the question relate to Beckham. We also have another alternative model, which has a two-way recurrent neural networks. We use the input question to build a model through layers of two-way recurrent neural network, generating a vector as result. The model will then use this vector to predict which entity in the database corresponds to the given question, and what relationship the question is asking. At the end, the model will find “Harry Potter” is the entity the question asks. After finding the right entity, the model also find the correct answer.
My last topic is on the robot that can produce news automatically, named XiaomingBot. In 2008, before the Beijing Olympics, a newsbot was invented to write news articles. The robot wrote 450 news articles on topics like ping pong, badminton, soccer, and tennis. In 16 days, it gained about one million viewerships. Later, some research show that the views of a sport news article written by a professional reporter is about the same as those written by XiaomingBot; sometimes XiaomingBot might have even gotten higher views. Xiaomingbot generates not only short news messages, but also long articles. For example, the news about women’s soccer is longer, with more detailed reporting on the game. Different from traditional newsbot, Xiaomingbot has two different features. The first one is that the robot is very fast – XiaomingBot will be able to generate and publish the corresponding news articles to readers in two seconds after the game finishes. The entire process is very brief since it is all taken care by the machine. This is also a feature of TouTiao. The second feature is that we produce short and long news content. Furthermore, XiaomingBot can add pictures about the game into the article. The generated content about the game will match the game timeline, especially for soccer games. The entire news generation process combines syntactic generation techniques with machine learning, which makes our news articles more professional.
Now, we have have chatbots and new robots. The question is, will machines take over everything? Definitely not. But what are the defects of machines, and what can’t they do? It is very easy for users to find that what they said will confuse the bot when they communicate with it. Even though our chatbot can answer technical questions with 75.7% accuracy, it still cannot handle more general questions like asking the mechanism behind the topic, procedure about completing a task, or explain a concept in detail. When you ask the bot “what is the meaning of life,” it will not know how to respond. We have good newsbot for generating sport news, however, if you want to change it into a more general text-generation robot, then we are not ready for that yet.
Why do machines have all of these limitations? At the very beginning of this talk, we mentioned that deep learning or machine learning are good ways to solve supervised learning problems. However, its effectiveness in solving supervised problems is also the limitation, due to the high demand of large amount of marked data. The price of marking these data can be really high sometimes.
Its versatility or scalability is the second limitation. Our chatbots can only answer technical question. This is the current limit. What can we do to increase the scalability of our artificial intelligence? What problems or challenges do we need to solve? Here are the three technical problem our artificial intelligence and machine learning experts are working on:
The first problem is the interpretability of machine learning models. Deep learning models do a good job of solving various kinds of problems. Nonetheless, sometimes even though we find models to be good, but we don’t know what’s good about it. In the opposite, maybe our models make mistakes, but we don’t know why they do so. This is an interpretability problem. Our machine learning still need more research in models and methods so that the system itself can make prediction, explain or analysis its behaviors. When the model makes mistakes, it knows why it makes that mistakes and analyze them, just like a human. This is the first point.
The second problem is the ability to reason. Our models should be able to reason and interact with other objects in the same environment. The current machine learning we have is still far from performing reasoning ability; it can only do simple identification, like identify a category of things. Performing complicated reasoning is pretty hard to achieve, and we need to work on it more.
The third problem might be what we have ignored before: computational resources. So far, our research focuses more on models, functionalities and accuracy. We did not pay much attention to the fact that these computer programs, which have already surpassed human intelligence, consume tons of tons of computing resources. Several thousands of machines will also consume a large amount of energy. We are thinking that maybe future algorithms can help us to reach even higher intelligence level by consuming the least amount of energy.
That’s all I have. Thank you for your time!
Original Article from Synced China http://www.jiqizhixin.com/article/1406 | Author: Gale Zhao| Localized by Synced Global Team: Jiaxin Su, Meghan Han, Rita Chen