AI Machine Learning & Data Science Nature Language Tech Research

Grounded Language Learning: A Look at the Paper ‘Understanding Early Word Learning in Situated Artificial Agents’

This paper carefully engineers artificial-language learning experiments to replicate sources of information infants learn from and under what conditions.


Nowadays, neural network-based systems can learn and process different languages in order to perform associated actions. To achieve this, a neural network has to achieve so-called grounded language learning, where the models must overcome certain challenges. These challenges share many similarities with those faced by infants when learning their first words.

Infants learn the forms of words by listening to the speech they hear. Though little is known about the degree to which these forms are meaningful for infants, the words still play a role in early language development. Similarly, it is notable that while models with no meaningful prior knowledge can also overcome these obstacles, researchers currently lack a clear understanding of how they do so. The 2017 paper Understanding Early Word Learning in Situated Artificial Agents from DeepMind researchers Felix Hill, Stephen Clark, Karl Moritz Hermann and Phil Blunsom addresses this problem.


Just as infants discover and learn words early in life, it is assumed, quite reasonably, that the sorts of learning that infants demonstrate can also extend to language learning agents. Many natural phenomena/behaviours of infant learners are characteristics of natural language processing agents. This paper carefully engineers artificial-language learning experiments to replicate sources of information infants learn from and under what conditions; and explores how best to characterize the learning process and what the end result may be.

The paper is organized into the following topics.

  1. Experimental setup for training the agent and terminology used throughout
  2. The architectural aspects of the learning agent
  3. The word learning dynamics of the agent and how they relate to human language learning dynamics
  4. Particular observations in human learning and similar phenomenon within artificial learning agents
  5. Analysis of the learning agent in attempts to explain the agent’s ability to induce meaningful extensions of words

Creating a 3D World for Language Learning

The experiments conducted in this paper take place in the DeepMind Lab simulated world (Beattie et al., 2016), a first-person 3D game platform designed for research and development of general artificial intelligence and machine learning systems. The DeepMind Lab can be used to study how autonomous artificial agents may learn complex tasks in large, partially observed, and visually diverse worlds.

In each episode, the agent receives a single word instruction and is rewarded for satisfying the instruction. For example, the word “pencil” could result in a positive reward if the agent “finds and bumps into a pencil”. At each time step in the episode, the agent receives a 3 × 84 × 84 (RGB) pixel tensor of real-valued visual input, and a single word representing the instruction, and must execute a movement action from a set of 8 actions. The 8 actions include move-forward, move-back, move-left, move-right, look-left, look-right, strafe-left, and strafe-right. An example of the word classes and instruction meaning is shown below in Table 1.

Table 1. Word classes learned by the situated agent. For the experiment, the complete set of words is 56, categorized into 5 different classes.

The episode ends after the agent bumps into any object, or when a limit of 100 time-steps is reached. In order to successfully complete the tasks, the agent must first learn to perceive its environment. The agent must actively control its visual surroundings via movement of its head, i.e., the turning actions. In addition, it must navigate its surroundings through meaningful sequences of actions.

A visual example of this experimental setup is shown in Figure 1. The agents observes two 3D rotating objects and a single-word language instruction. It must then select the object which matches the instruction.

Figure 1. Example of the word learning environment. In this example, the word chair is from the “shape” category. The agent must select chair in order to receive a reward.

The experiment has a fixed overall setup of the following factors:

  • layout of the experiment (fixed to this rectangular room)
  • the range of positions in which the agent begins an episode (towards the end of the room)
  • the locations that objects can occupy (two objects in front of the room)
  • the list of objects that can appear and the relative frequencies in which these objects appear
  • the rewards associated with selecting a certain object given an instruction word

It may seem that the environment is incredibly constrained. However, there are still a high number of unique configurations to which the episodes can be set.

A Situated Word-Learning Agent

In this section, we will discuss the modules involved in the agent architecture. The input to the agent at each time step includes a module for processing symbolic input (an embedding layer) and a visual input (a convolution network). This is followed by a feed-forward linear layer (a mixing module) that combines the inputs and passes it on to an LSTM core memory. The hidden state of the core memory (LSTM) is fed into an action predictor (a fully-connected layer plus softmax), which computes the policy, and an estimator value that is used to calculate the expected “reward”. Figure 2. displays the architecture.

Figure 2. Schematic view of the agent architecture.

We will take a more detailed look at the agent architecture. At each time step t, the visual input v_t is encoded by the convolutional visual module and the language module embeds the instruction word l_t. The mixing module operates on the concatenation of v_t and l_t. The hidden state s_t of the LSTM is fed into an action predictor that computes the policy. The policy is a probability distribution over possible motor actions, e.g., π(a_t | s_t). The state-value function estimator Val(s_t) computes a scalar estimate of the agent state value function, which is the expected future return. This value estimate is used to compute a baseline for the return in the asynchronous advantage actor-critic (A3C) policy-gradient algorithm (Mnih et al., 2016), which determines weight updates in the network in conjunction with the RMSProp optimizer (Tieleman and Hinton, 2012).

The Word-Learning Dynamics

In the first experiment, the weights in the agent network are randomly initialized. The agent is trained on episodes with instruction words referring to the shape, colour, pattern, relative shade, or position of objects. The episode settings are as discussed previously. The agent starts at the end of a small room with two objects at the other. Single instruction words are presented as discrete symbols at each time step. All words appear with equal frequency during training. The instruction word in each episode unambiguously specifies one of the two target objects. The agent receives a +10 reward if it bumps into the correct object, a −10 score if it bumps into the wrong object, and 0 if the maximum number of time steps has been reached.

It was observed that the agent slowly learned to respond correctly to the words it was presented with. In addition, at some point the rate of word learning accelerated rapidly. This is an interesting observation as this phenomenon is also observed in young infant learners! Separate experiments were also carried out by training the agent with RL algorithms from pixel inputs. In both cases, the agent was able to walk directly up to the two objects and reliably identify the appropriate object by the end of training. The training process was observed to accelerate if the agent had prior knowledge of some words. Experiments were carried out on agents with prior knowledge of 2 words and 20 words. This was done by training the agent on the word-learning task but restricting the vocabulary to the 2 and the 20 words, respectively. The agent pre-trained on 20 words learned new words more quickly. This phenomenon is similar to that of human development, where learning becomes easier the more the learner knows about the language. These procedures/observations are displayed in Figure 3.

Figure 3. Learning trajectories for the agent (left). Vocabulary size trajectory in a human infant (right).

Experiments were also carried out in attempts to reduce the number of rewarded training episodes before word learning onset, in the form of a curriculum. This was done by moderating the scope of the learning challenge faced by the agent initially, before later expanding its experience once word learning had started. Specifically, the agent is first trained to learn the meaning of the 40 shape words under two conditions:

  1. The agent is presented with the 40 words sampled randomly throughout training.
  2. The agent is trained with only a subset of the 40 words (selected at random) until the words are mastered (as indicated by an average reward of 9.8/10 over 1000 consecutive trials).

Once the two conditions were met, the subset was expanded to include more words. For example, the agent is initially trained with a 2-word subset. When the agent learns both words with high confidence, the subset is extended to a 5-word subset, then a 10-word subset, and so on until finally the agent is exposed to all 40 words. It was observed that the agent following the curriculum reached 40 words faster than the agent confronted immediately with a large set of new words. This effect aligns with the idea that early exposure to simple, clear linguistic inputs helps a child’s language learning capability (Fernald et al., 2010). It also aligns with curriculum learning effects observed when training neural networks on text-based language data (Elman, 1993; Bengio et al., 2009).

Another method to reduce the number of episodes required to achieve word learning is by applying an auxiliary learning objective on stored trajectories of the agent’s experience. In agents with this auxiliary prediction process, the final four observations of each episode are saved in a replay buffer and processed offline by the visual and language modules. The concatenation of the output of these modules is then used to predict whether the episode reward was positive, negative or zero. A cross-entropy loss on this prediction is optimized jointly with the agent’s A3C loss. This application of an auxiliary prediction loss can be seen as a rudimentary model of hippocampal replay biased towards rewarding events, a mechanism that is thought to play an important role in both human and animal learning (Schacter et al., 2012; Gluck and Myers, 1993; Pfeiffer, 2017). Figure 4. illustrates the two methods and their results in reducing the number of episodes required to achieve learning.

Figure 4. The effect of reward-prediction auxiliary loss on learning speed for an agent learning the full vocabulary of different word types (left). Word learning trajectories for an agent following a curriculum (right).

Word Learning Biases

It is observed in the language learning behaviour of infants and young children that they exploit certain labelling biases during early learning of simple words, which serve to constrain the possible referents of novel, ambiguous lexical stimuli (Markman, 1990). In particular, shape bias is a phenomenon in which infants tend to presume that novel words refer to the shape of an unfamiliar object rather than, for instance, its colour, size and texture. This phenomenon is also studied with respect to a learning agent in this paper. The DeepMind Lab environment allows for a good replication of experiments that uncover learning biases in infants.

During training, the agent learns word meanings in a room containing two objects, one that matches the instruction word (positive reward) and a confounding object that does not (negative reward). The agent attempts to learn the meaning of a set C of colour terms, a set S of shape terms and a set A of ambiguous terms. The target referent for a shape term s ∈ S can be of any colour c ∈ C and, similarly, the target referent when learning the colours in C can be of any shape. In contrast, the ambiguous terms in A always correspond to objects with a specific colour c_a ∉ C and shape s_a ∉ S. For example, a nonsense term “dax” ∈ A always refers to a black pencil during training. However, neither “black” nor “pencil” is observed in any other context.

During the learning process, periodic measurements of bias is inserted. This is done by using test episodes that do not take part in the learning process. Within a test episode, the agent will receive an instruction a ∈ A and must decide between two objects:

  • o_1: shape is s_a, colour is c’ ∉ C ∪ {c_a} (e.g., a blue pencil),
  • o_2: shape s’ ∉ S ∪ {s_a}, colour is c_a (e.g., a black fork).

Neither the colour blue nor the shape fork is observed by the agent during training. In agreement with the original human experiment, the degree of shape bias in the agent can be measured. As the agent is learning, it tends to select o_1 in preference to o_2. Experiments were carried out to induce shape/colour biases into agents. The agents were exposed to different training regimes and their biases were observed. Under the first setting, the agent is taught exclusively colour words (|S| = 0, |C| = 8). Unsurprisingly, this leads to the agent developing a strong colour bias. Under the second regime, the agent is taught an equal number of shape and colour terms (|S| = 8, |C| = 8) and also develops a colour bias. Finally, it was observed that it took a larger set of shapes (|S| = 20, |C| = 0) used in training for the agent to develop (human-like) shape bias. Figure 5. displays how a shape/colour bias develops in an agent under the different regimes.

Figure 5. Degrees of shape bias for different training regimes.

The blue lines represent the biases. The dotted blue line in the centre represents the neutral state with no bias. The solid blue line indicates whether there is shape or colour bias depending on whether it is above or below the neutral line. It can be seen that shape biases only occur in the regime in which the agent is trained exclusively on shapes. This does not align with the findings of Landau at al., but a potential explanation could provide insight on why this happens. It may be because, unlike information pertinent to shapes, agents have direct access to colour in the form of RGB in its pixel inputs. Thus, if the environment is balanced, the agent may favour colours over shapes. It is interesting to note that, Ritter et al. were able to perform experiments which resulted in shape bias in convolutional networks trained on ImageNet. The experiments suggest that this effect is more likely driven by the distribution of training data (the ImageNet data contains many more shape-based than colour-based categories) rather than the underlying convolutional architecture.

Relating these observations to human learners, it can be concluded that environmental factors play a role in the development of biases. For example, the agent develops a bias towards a category if that category occurs with a higher frequency. Shape terms come up with a higher frequency than colour terms in a linguistic environment for (American) children (this is verified by analyzing the child-directed language corpus Wordbank).

Visualizing grounding in both action and perception

Another phenomenon of human learners studied in this paper is the ability of infants to make sense of apparently unstructured raw perceptual stimuli. This requires the learner to induce meaningful extensions of words (when there are limitless potential referents in the environment), and to organize these word meanings in semantic memory.

To give a concrete example, consider a child trying to learn the concept of the word “ball.” The child is repeatedly exposed to a little red ball until eventually they recognize it as a “ball.” Suppose that the little red ball is then substituted with a basketball. The child most likely will not mistake the basketball for their little red ball, but, in a sense, this is precisely what they will later learn — that the word ball may be used to signify any member of an indefinitely large class of objects.

Similar effects are also observed in language learning agents. After a given period, there will be a set of percepts which an agent has previously encountered, and from which it has induced a concept of which will enable them to recognize an indefinitely large number of instances of the pattern presented in the future.

Analyses of trained agents were performed to better understand how agents solve the problem of cross-situational word learning. First, we take a look at the visualization space of word embeddings in an agent trained on words from different classes. It can be observed that these word classes, which align with both semantic (shape vs. colour) and syntactic (adjective vs. noun) categories, emerge naturally in the embedding space of the agent. Figure 6. displays an example of an agent’s word representation space.

Figure 6. t-SNE projection of semantic and syntactic (adjective/noun) classes in the agent’s word representation space.

Conclusion and Final Thoughts

This paper provides an analysis of a language learning agent. The goal is to achieve a better general understanding of grounded language learning, both to inform future research and to improve confidence in model predictions. A longstanding observation is that the challenges which the learning agents face during training are similar to the challenges infants face when first learning languages. This paper studies these similarities and explores the conditions under which human biases in learning typically form. It further provides visualizations and analysis of the semantic representations in grounded language learning agents.

The paper Understanding Early Word Learning in Situated Artificial Agents is on arXiv.

Author: Joshua Chou | Editor: H4O & Michael Sarazen

Image for post

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how the Chinese government and business owners have leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle.

Click here to find more reports from us.

We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Image for post

3 comments on “Grounded Language Learning: A Look at the Paper ‘Understanding Early Word Learning in Situated Artificial Agents’

  1. Pingback: Grounded Language Learning: A Look at the Paper 'Understanding Early Word Learning in … – Careers in AI

  2. Pingback: Grounded Language Learning: A Look at the Paper 'Understanding Early Word Learning in … – Paper TL

  3. Pingback: A Look at Paper‘Understanding Early Word Learning in Situated Artificial Agents’ | Hacker News

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: