Content provided by Alexander Wong, the co-author of the paper UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus.
In recent years, the volume of data being collected in healthcare has become enormous. Consequently, in order to use this vast amount of data, advanced Natural Language Processing (NLP) models are needed. This has led to the creation of highly-performing optimized NLP models focused on the biomedical domain. Contextual word embedding models based on deep Transformer networks, such as BioBERT and BioClinicalBERT, have achieved state-of-the-art results in biomedical natural language processing tasks by focusing their pre-training process on domain-specific corpora. However, such models do not take into consideration expert domain knowledge.
In this work, we introduced UmlsBERT, the first Transformer network to integrate domain knowledge. We accomplish this through a novel knowledge augmentation strategy to imbues the wealth of clinical expert domain knowledge from the Unified Medical Language System (UMLS) Metathesaurus into a modified deep Transformer architecture. As a result, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models on common named-entity recognition (NER) and clinical natural language inference clinical NLP tasks.
What’s New: We are the first, to the best of our knowledge, to propose a deep Transformer network architecture imbued with clinical domain knowledge from a clinical Metathesaurus in order to build ‘semantically enriched’ contextual representations that will benefit from both the contextual learning and domain knowledge.
We proposed a new multi-label loss function for the pretraining of a deep Transformer network that considers the connections between clinical words using the concept unique identifiers.
We introduced a semantic group embedding that enriches the input embeddings process by forcing the model to take into consideration the association of the words that are part of the same semantic group.
How It Works: UmlsBERT is a deep Transformer network architecture that incorporates clinical domain knowledge from a clinical Metathesaurus in order to build ‘semantically enriched’ contextual representations that will benefit from both the contextual learning and domain knowledge. More specifically, we imbue UmlsBERT with clinical expert domain knowledge from the Unified Medical Language System (UMLS) Metathesaurus via a novel knowledge augmentation strategy comprising of two aspects: i) connecting words that have the same underlying ‘concept’ in UMLS via a new multi-label loss function, and ii) introducing semantic group embeddings that enriche the input embeddings by forcing the model to take into consideration the association of the words that are part of the same semantic group.
By applying this knowledge augmentation, UmlsBERT can encode clinical domain knowledge into word embeddings for significantly improved natural language understanding performance in domain-specific tasks such as clinical natural language tasks. We demonstrated that UmlsBERT outperforms two popular clinical-based BERT models (BioBERT and BioClinicalBERT) and a general domain BERT model on different clinical named-entity recognition (NER) tasks and on one clinical natural language inference task.
Key Insights: The main takeaway is that the incorporation of clinical domain expert knowledge within a deep Transformer network architecture can significantly improve clinical natural language understanding performance when compared to purely leveraging contextual learning. This can have a huge impact on clinical natural language understanding applications, particularly by enabling customization for optimal performance based on different clinical specializations where the domain expert knowledge may differ but of great importance.
As for future work, we plan to extend our work by examining the effect of augmenting contextual embeddings with medical knowledge when more complicated layers (for down-stream tasks) are used atop of the output embeddings, as well as further explore extending metathesaurus associations beyond the current semantic groups we have leveraged. We hypothesize this will be particularly beneficial for down-stream tasks such as automatic clinical coding tasks.
The paper UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus is on arXiv.
Meet the author Alexander Wong, co-founding member of the Waterloo Artificial Intelligence Institute, and Chief Scientist of DarwinAI.
Share Your Research With Synced Review
Share My Research is Synced’s new column that welcomes scholars to share their own research breakthroughs with over 1.5M global AI enthusiasts. Beyond technological advances, Share My Research also calls for interesting stories behind the research and exciting research ideas. Share your research with us by clicking here.