McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation

In an EMNLP 2020 Clinical NLP workshop last week, a Montreal-based research team introduced a large medical text dataset designed to boost abbreviation disambiguation in the medical domain.

Nowhere is correct terminology more critical than in medicine and health care, where text mining and natural language processing can build deep learning models for diagnosis prediction and other tasks. Unfortunately, research and clinical applications in this area have suffered from a lack of publicly available pretraining data due to privacy restrictions, and a glut of non-standard abbreviations in the data that is available. Patient-safety organization Institute for Safe Medical Practices earlier this year listed no fewer than 55,000 medical abbreviations which could “fail to communicate with any certainty their intended meaning and present possible dangers to the health of patients.”

The researchers from McGill University, Facebook CIFAR AI Chair and Mila – Quebec Artificial Intelligence Institute introduced the Medical Dataset for Abbreviation Disambiguation for Natural Language Understanding (MeDAL) to sort out all those contradictory, ambiguous and potentially dangerous abbreviations.

Screen Shot 2020-11-23 at 10.24.55 AM.png

Screen Shot 2020-11-23 at 10.24.14 AM.png

Created from PubMed abstracts released in the 2019 annual baseline, MeDAL is a large dataset of medical texts curated for medical abbreviation disambiguation tasks that can be used to pretrain natural language understanding models. The dataset comprises 14,393,619 articles and on average three abbreviations per article. The researchers say pretraining on MeDAL leads to improved model performance and convergence speed when fine-tuning on downstream medical tasks.

Screen Shot 2020-11-23 at 10.24.23 AM.png

Unlike existing medical abbreviation disambiguation methods that focus on improving performance on abbreviation disambiguation, the proposed approach uses abbreviation disambiguation as a pretraining task for transfer learning on other clinical tasks. The team built a dataset large enough for effective pretraining as existing medical abbreviation disambiguation datasets are very small compared to those used for general language model pretraining.

Screen Shot 2020-11-23 at 10.38.25 AM.png

The team conducted evaluations on tasks such as mortality prediction and diagnosis prediction using LSTM, LSTM + Self Attention and transformer models. On the mortality prediction task, all three pretrained models performed better than their from-scratch counterparts. On the diagnosis prediction task, Both LSTM and LSTM + self attention’s performance increased by more than 70 percent.

The results suggest that pretraining on the MeDAL dataset can generally improve models’ language understanding capabilities in the medical domain.

The paper MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining is on ACL Anthology. The code is on GitHub. The MeDAL dataset is on Kaggle or Zenodo. The EMNLP 2020 (Empirical Methods in Natural Language Processing) website is here.

Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors

This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.

Click here to find more reports from us.

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

4 comments on “McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation”

Pingback: [R] McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation – ONEO AI
Pingback: [R] McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation – tensor.io
Viktorr

2026-03-11

We began scaling cold outreach earlier this year and honestly manual emailing just became impossible to manage. After comparing a few automation tools we decided to try Dripify and it quickly became part of our daily workflow. What’s the top email automation tool in 2026 if starting cold email outreach? Try Dripify email automation software for your outreach campaigns to safe your time and contact more leads. Dripify is low priced email outreach tool that gives you more features and efforts. Now campaigns run almost automatically and we can focus more on leads that actually respond.

Loading...

steelheart4124 steelheart4124

2026-05-05

I decided to finally organize a large collection of cards I hadn’t touched in years, and it quickly became clear how slow manual price checking can be. Somewhere in the middle of that process I tried a pokemon card scanner app, and it actually helped a lot. Being able to scan cards and instantly get a rough value makes sorting much easier, especially when dealing with bulk cards. It also helps identify which ones might be worth taking a closer look at later. The accuracy isn’t perfect for every rare card, but for general use it’s reliable enough. Overall, it’s a simple way to save time and keep your collection organized.

Loading...

McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation

Like this:

4 comments on “McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation”

Leave a Reply Cancel reply

Related

Share this:

Like this:

4 comments on “McGill University, Facebook & Mila Release 14M Article NLP Pretraining Dataset for Medical Abbreviation Disambiguation”

Leave a Reply Cancel reply

Related