Although modern machine learning models have achieved tremendous advancements in natural language processing (NLP), their focus has been strictly text-based. In the real world, documents often also contain important visual formatting information and features such as tables, figures and charts that have their own local semantic properties. This information and the hierarchical structure and context-dependent nature of such documents are not fully exploited in NLP models’ plain-text sequence to sequence learning.
Document intelligence is a research area that aims to automate richer information extraction and understanding. While pretraining methods for advanced document understanding have shown promise, challenges remain in building general pretrained models that can handle and benefit a wide range of documents.
An Adobe Research and Adobe Document Cloud team addresses this in their new paper Unified Pretraining Framework for Document Understanding, presenting a unified pretraining framework for document understanding that enables cross-modal connections, relevant information highlighting in both visual and textual modalities, and cross-modal connections. UDoc achieves impressive performance on various downstream tasks.
The team summarizes their main contributions as:
- We introduce UDoc, a powerful pretraining framework for document understanding. UDoc is capable of learning contextual textual and visual information and cross-modal correlations within a single framework, which leads to better performance.
- We present Masked Sentence Modeling for language modelling, Visual Contrastive Learning for vision modelling, and Vision-Language Alignment for pretraining.
- We present extensive experiments and analyses to validate the effectiveness of the proposed UDoc.
UDoc takes a document’s image regions and words as inputs, extracts their respective embeddings via a visual feature extractor and a sentence encoder, then feeds these embeddings into a transformer-based encoder to obtain cross-modal contextualized embeddings that merge both visual and textual features.
UDoc comprises four components: feature extraction, feature embedding, a multi-layer gated cross-attention encoder, and pretraining tasks. In the feature extraction step, UDoc adopts an off-the-shelf OCR tool to extract text from document images. For feature embedding, UDoc combines the textual embeddings and position encodings to generate multimodal embeddings. The multi-layer gated cross-attention encoder then takes a set of masked multimodal embeddings as input and is pretrained with three tasks: masked sentence modelling, vision-language alignment and visual contrastive learning.
The team’s empirical study compared UDoc with state-of-the-art methods BERT, LayoutLM and TILT on form and receipt understanding, document classification and document object detection tasks. The results show that the UDoc pretraining procedure enables it to take advantage of multimodal inputs, and that it can effectively aggregate and align documents’ visual and textual information with the proxy tasks. The team also notes that finetuning the pretrained UDoc on task-specific data can reduce data annotation costs while producing better results in document processing systems.
The paper Unified Pretraining Framework for Document Understanding is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.