AI Machine Learning & Data Science Nature Language Tech Research

Up Close and Personal With BERT – Google’s Epoch-Making Language Model

A recent Google Brain paper looks into Google’s hugely successful transformer network — BERT — and how it represents linguistic information internally.


Neural networks for Natural Language Processing (NLP) have advanced rapidly in recent years. Transformer architectures in particular have shown they perform very well on many different NLP tasks, appearing to extract generally useful linguistic features. A recent Google Brain paper looks into Google’s hugely successful transformer network — BERT — and how it represents linguistic information internally.

Much work has been done on analyzing language processing models. Such work includes syntactic feature extraction and a geometric representation of parse trees in BERT’s activation space. In this article, Synced will give a brief introduction to the BERT model before exploring the contributions of this paper.

A quick background on BERT

BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.” [1]

Per the above, BERT as a NLP model is bidirectionally trained. This bidirectional training was its main innovation when it was introduced. Most language models at that time would have looked at a sequence of training text from either left-to-right or right to left, and then potentially combined these two directional analyses.

BERT goes far beyond simply generating sentences by predicting the next word. BERT is also very capable at demanding tasks such as “fill in the blank.” BERT does this with a technique called Masked LM, where it randomly masks words in a sentence and then tries to predict the masked word. BERT looks in both directions and uses the full context of the sentence, both left and right surroundings, to predict the masked word. Unlike previous language models such as ELMo or OpenAI GPT, it takes both the previous and next tokens into account simultaneously.

Syntax Geometry

The Google Brain paper, Visualizing and Measuring the Geometry of BERT, examines BERT’s syntax geometry in two ways. First it looks beyond context embeddings to investigate whether attention matrices encode syntactic features. Then it provides a mathematical analysis of the tree embeddings.

This study is built on work by Hewitt and Manning [2]. Before going further, let’s review the common definition of syntax and the notion of a parse tree.

“When thinking about the English language (and many others as well), the meaning of a sentence is constructed by linking small chunks of words together with each other, obtaining successively larger chunks with more complex meanings until the sentence is formed in its entirety. The order in which these chunks are combined creates a tree-structured hierarchy” [3]. Each sentence’s tree-structured hierarchy is referred to as a parse tree, and the phenomenon broadly referred to as syntax.

Next we’ll discuss how a model-wide attention vector is constructed.

Attention probes and dependency representations

The authors make use of attention matrices to analyze the relations between pairs of words. The authors introduce an attention probe, which is defined as a task for a pair of tokens (token_i, token_j), where the input is a model-wide attention vector. The model-wide attention vector is presumed to encode relations between the two tokens. Figure 1. illustrates the model-wide attention vector.


Here’s how we can construct a single model-wide attention vector for tokens (token_i, token_j). Consider an input sequence of 10 symbols (as illustrated in Figure 1.). We can think of this sentence as a sequence of vectors (in some vector space), and the attention as an encoding operation which maps these vectors into another vector space. This encoding operation aims at simultaneously capturing diverse syntactic and semantic features within the set of vectors. Each encoding operation corresponds to a particular attention head, which is represented by an attention matrix. Each layer may have multiple encoding operations, together forming a multi-head attention structure. The model-wide attention vector is formed by concatenating entries a_{i,j} in every attention matrix from every attention head in every layer. This attention vector becomes the input for the attention probe, which tries to identify the existence and type of dependency between the two tokens.

Experiment and Results

Experiment setup

The researchers’ first experiment involved running sequences through BERT to obtain the respective model-wide attention vectors between every token in the sequence excluding the “sentence start” and “sentence end” tokens ([SEP] and [CLS] in Figure 1.).

The dataset for this experiment was a corpus of parsed sentences from the Penn Treebank [4]. The dataset’s constituency grammar for the sentences was translated to a dependency grammar using the PyStanfordDependencies library.

Once the model-wide attention vectors were obtained, two L2 regularized linear classifiers were trained using stochastic gradient descent. The first was a binary classifier used to predict whether or not an attention vector corresponds to the existence of a dependency relation between two tokens. The second was a multiclass classifier used to predict which type of dependency relationexists between the two tokens, given that the first model identifies existing dependencies between the tokens. The type of dependency relation describes the grammatical relationships between the two tokens. Consider the following example:


There exists a grammatical relationship between the words within the sentence. In this case the words are classified according to the Stanford typed dependencies manual [5], as follows:

  • nsubj: nominal subject A nominal subject is a noun phrase which is the syntactic subject of a clause. The governor of this relation might not always be a verb: when the verb is a copular verb, the root of the clause is the complement of the copular verb, which can be an adjective or noun.
  • acomp: adjectival complement An adjectival complement of a verb is an adjectival phrase which functions as the complement (like an object of the verb).
  • advmod: adverb modifier An adverb modifier of a word is a (non-clausal) adverb or adverb-headed phrase that serves to modify the meaning of the word

As in the example, it is possible to construct a directed graph representation of these dependencies where the words in the sentence are the nodes and grammatical relations are edge labels.


In the experiment, the binary classifier achieved 85.8 percent accuracy, while the multiclass classifier’s accuracy was 71.9 percent — demonstrating that syntactic information is encoded in the attention vectors.

Geometry of parse tree embeddings

When words are embedded in a Euclidean space, it is natural to consider the Euclidean metric as the “distance” between two words. In their work, Hewitt and Manning explore possible definitions of “distance” between words in a parse tree. One definition of a path metric d(w_i,w_j) is defined as the number of edges in the path between the two words in the tree [3]. Consider the following parse tree.


The distance between the words “chef” and “was” is 1 ( d(chef, was) = 1 ), while the distance between the words “store” and “was” is 4 ( d(store, was) = 4 ). The figure below illustrates this.


Hewitt and Manning observed that parse tree distance seems to correspond specifically to the square of Euclidean distance. Google’s BERT paper examines this definition more closely and questions whether the Euclidean distance is a reasonable metric.

The paper first extends the idea to generalized norms, defined as the following:


That is, the metric d(x, y) is the p-norm of the difference between two words passed through an embedding.

The authors conducted an experiment to visualize the relationship between parse-tree embeddings in BERT and exact power-2 embeddings. The input to each visualization was a sentence from the Penn Treebank with associated dependency parse trees. The authors extracted the token embeddings produced by BERT-large in layer 16, transformed by the Hewitt and Manning’s “structural probe” matrix B.


To visualize the tree structure, words with a dependency relation are connected with an edge. The colour of each edge indicates the deviation from true tree distance. The parse tree in Figure 2. (left) has the word “part” as its root, and beside it are the embeddings according to Hewitt-Manning probe. In addition, pairs of words without a dependency relation but whose positions (before principal component analysis) were far closer than expected are connected with a dotted line. The resulting image illustrates both the overall shape of the tree embedding and fine-grained information on deviation from a true power-2 embedding.

It is interesting to observe that the probe identifies important dependencies between words that were not immediately obvious from the parse tree.


The researchers present a series of experiments to gain insight on BERT’s internal representation of linguistic information, providing empirical results indicating evidence of syntactic representation in attention matrices. They also provide mathematical justification for the squared-distance tree embedding in the work by Hewitt and Manning.

The paper Visualizing and Measuring the Geometry of BERT is on arXiv.

Useful References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

[2] John Hewitt and Christopher D Manning. A structural probe for finding syntax in word representations. Association for Computational Linguistics, 2019.

[3] John Hewitt and Christopher D Manning. Finding Syntax with Structural Probes,

[4] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Comput. Linguist., 19(2):313–330, June 1993.

[5] Stanford typed dependencies manual,

Author: Joshua Chou | Editor: H4O & Michael Sarazen

2 comments on “Up Close and Personal With BERT – Google’s Epoch-Making Language Model

  1. Pingback: Up Close and Personal With BERT – Google’s Epoch-Making Language Model – News Blog

  2. Pingback: Data Science newsletter – February 19, 2020 |

Leave a Reply

Your email address will not be published. Required fields are marked *