“Things, not strings.” is one of the most quoted lines while talking about knowledge base. But what’s the primary difference between a thing and a string? A precise definition is hard to pin down, but we can illustrate it with examples. The subject-relation-object triplet in a traditional knowledge base with fixed handcrafted relations types are strings. Distributed representations of entities and relationships from which you can do knowledge reasoning are things. To sum up with a rough definition: things are strings with semantics.
Deep neural networks get remarkable results on natural language processing. With word embedding, we can finally have state-of-the-art results on semantic interpretation. In 2016, Google Translate showed us how powerful neural nets can be in machine translation. What about in knowledge base? Andrew McCallum gave an introductory answer to the problem. Researchers are already there. In a talk at the University of Toronto, he explained that with a method called universal schema, knowledge representation and reasoning from natural language can be better accomplished.
What is Universal Schema?
The universal schema is “the union of all schemas”, a mixture of “structured” and “natural” schema. It means using several knowledge bases together without trying to map them into a single schema. There are no dominate rules. You can even use raw text as another kind of schema, a raw text schema can be a large schema with many relation types inside it.
Comparison Between Different Styles of Schemas
Now let’s use the example of relation extraction to compare universal schema with other styles.
Common relation extraction styles are
- Supervised: schemas are designed by hand, data is human labeled.
- Downside: The data labeling process is a painful process
- Distantly supervised: on the basis of an existing knowledge base with schemas we like, align unstructured texts with records in the database, learn the pattern, build extractors, run it on new text. Best performers in many contexts.
- Downside: vulnerable to errors
- Unsupervised with no schema at all: run a dependency parser on unstructured texts, pull out the verbs as relation types and the argument as the relations.
- Downside: sparsity, has no semantic interpretation ability;
- Unsupervised with schema discovery: cluster relations so that if I observe one in a cluster, I can answer others in the same cluster.
- Downside: The relations in one cluster are close but not the same. Asymmetric cases are not dealt with.
Usual downsides like arbitrary, incomplete, hard to evaluate and too many boundaries can be overcome through the usage of universal schema.
We can use a matrix to better illustrate the method.
First, we put entity pairs in the rows and relation types in the columns. As mentioned above, those entity pairs and relation types come from all different sources: Freebase, Wikipedia, Knowledge Base Population (KBP) etc. We fill the observations into the matrix, then run generalized principle component analysis to fill the rest of the matrix. In the table, those pairs with higher probability are marked with green and those less likely to happen are marked with pink.
From the observation of “Clinton criticized Bush”, “Clinton denounced Bush” and “Forbes denounced Bush”, we can predict “Forbes criticized Bush” with high probability.
The cleverness of universal schema is also shown in its ability to learn asymmetric entailment. If a person is a historian at a certain university, then we know the person is also probably a professor of that university, but not vice versa. The fact that Freeman is a professor of Harvard doesn’t suggest he is a historian of Harvard. Universal schema can capture that relationship clearly.
Mechanics and Learning of Universal Schema
Embedding is the key technique used here.
Take relation extraction as an example again. Entity type extraction is a unary relation extraction problem. We may have over 20 thousand entity types and over 300 thousand entities. If we build a model in the space of raw observed features directly, we may need millions of parameters.
Embed both entities and entity types into low dimension(50-100) latent spaces, learn neighborhood dimensions so that semantically similar entity types or entities will be near each other. Make predictions by doing interproduct of learned embedding of entities(x_e) and entity types(y_t).
Train by picking observed and unobserved instances in the same row, calculate the dot product score and compare them.
Generalizing and Scaling the model
The characteristics of universal schema makes it easier to deal with unseen entries.
We can get rid of row embeddings (embeddings of entities) and estimate it on the fly by using aggregation functions of column embeddings of observations.
We can also have columnless embeddings, instead of having parameters for each type, we now have parameters for each single word. Then we may train a pattern encoder (LSTM) that whenever we encounter a new type, no matter how complicated it is, it can stitch semantics together to produce a column embedding on the fly.
Author: Luna Qiu|Editor: Hao Wang | Localized by Synced Global Team: Xiang Chen