An IBM research team has introduced a couple of novel prediction systems designed to identify information types for entity mentions within text documents without the need for manual annotations.
Knowledge graphs (KGs) are graphs used to accumulate and convey real-world knowledge. KG nodes capture information about entities of interest (like people, places or events) in a given domain or task, while the edges represent the connections between them. To provide vital information for related tasks such as Knowledge Base Question Answering (KBQA), various semantic web technologies have been employed to represent KGs with explicit semantics, defining a type for each node. A “Tylor Swift” node for example could be classified as a “popular singer” type. Although predicting KG type information is crucial for solving KG-related and downstream tasks, most existing work in this area uses supervised solutions that operate on relatively small-to-medium-sized systems.
In the paper Type Prediction Systems, the IBM researchers introduce two systems for predicting type information at any granularity and without annotations. Their TypeSuggest module is an unsupervised system designed to generate types for a set of seed query terms input by the user, while the Answer Type prediction module predicts the correct answer type for user-provided questions.
TypeSuggest uses a predefined type system (TS) such as DBPedia, or Wikidata as a source of potential types. Given a set of seed terms Q as input, TypeSuggest will generate a ranked list of relevant types as output. The method uses the following steps:
- Entity Linking: The first step links the terms in Q to a taxonomy in TS. This is done by examining the similarity of entity labels in TS, from which the team can get a list of seed terms (LS) that are linked to their corresponding entity within TS.
- Seed Expansion: The second step uses a pretrained Word2Vec to expand the seed terms LS if LS is less than the minimum number of seed terms K. To do this, the team identifies the most similar term y that links to a valid entity in TS, and they continue adding y to TS for every iteration until LS is equal to K.
- Type Identification: The final step identifies types based on the linked seed terms. The team ranks the types using a tf-idf like function (term frequency – inverse document frequency, a metric that reflects how important a word is to a document in a collection or corpus), and returns the ranked list as an output of the TypeSuggest module.
The researchers built their Answer Type Prediction model based on these TypeSuggest outputs. Answer Type Prediction comprises three steps: preparing type embeddings from the type vocabulary T, encoding input questions qi to their corresponding question embedding ~qi, and building a simple learning framework that uses ~qi and T as inputs and produces a list of ranked types T_i as output.
The team describes the two proposed systems in detail, starting with a data ingestion phase, followed by a pre-processing phase and finishing with a neural network-based learner for their Answer Type prediction module. They also demonstrate how both systems function without manual annotations, which they regard as the most appealing point of the research, as it makes the system applicable “as-is” across a wide variety of domains.
The paper Type Prediction Systems is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.