A team of researchers from Carnegie Mellon University and Facebook AI recently introduced the tabular data model TaBERT. Built on top of the popular BERT NLP model, TaBERT is the first model pretrained to learn representations for both natural language sentences and tabular data, and can be plugged into a neural semantic parser as a general-purpose encoder. In experiments, TaBERT-powered neural semantic parsers showed performance improvements on the challenging benchmark WikiTableQuestions and demonstrated competitive performance on the text-to-SQL dataset Spider.
Since Google Brain introduced BERT (Bidirectional Encoder Representations from Transformers) in 2018, the large-scale pretrained language model has achieved SOTA results across a wide range of NLP tasks. BERT and similar pretrained language models however typically train on free-form natural language text, and are not equipped to tackle tasks like semantic parsing over the structured data found in typical database tables.
For example, how would a pretrained and fined-tuned language model answer the question, “In which city did [race car driver] Piotr’s last 1st place finish occur?” when given a relevant data table with columns for year, venue, position, and event? The model would need to understand the set of columns in the table (aka schema) and match the input text accurately with the schema to reason the correct response. TaBERT was pretrained on a parallel corpus of 26 million tables and their context information to identify associations between tabular data and related natural language text.
Facebook says that unlike systems that rely on the representation of input utterances and table schema, TaBERT can be plugged into a neural semantic parser as a general-purpose encoder to compute representations for both utterances and tables. A content snapshot of the pairing table is first created based on the input utterance, and a transformer then encodes each row in the snapshot into vector encodings of utterance and cell tokens. Because these row-level vectors are computed independently, researchers implemented a vertical self-attention mechanism that operates over vertically aligned vectors from different rows to allow for information flow across their cell representations.
In experiments, TaBERT was applied to two different semantic parsing paradigms: the classical supervised learning setting on the SPIDER text-to-SQL dataset, and the challenging, weakly-supervised learning benchmark WikiTableQuestions. The team observed that systems augmented with TaBERT outperformed counterparts utilizing BERT and achieved state-of-the-art performance on WikiTableQuestions. On Spider, the performance ranked close to submissions atop the leaderboard.
The introduction of TaBERT is part of Facebook’s ongoing efforts to develop AI assistants that deliver better human-machine interactions. A Facebook blog post suggests the approach can enable digital assistants in devices like its Portal smart speakers to improve Q&A accuracy when answers are hidden in databases or tables.
The paper TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data is available on Facebook Content Delivery Network.
Journalist: Fangyu Cai | Editor: Michael Sarazen
This report offers a look at how the Chinese government and business owners have leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle.
Click here to find more reports from us.
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.