Large Language Models (LLMs) and their multi-modal variants offer significant benefits in automating complex processes, with Document Understanding (DU) being a particularly promising application. In DU, the challenge often lies in integrating text, layout, and graphical elements to accurately extract necessary information.
In a new paper Arctic-TILT. Business Document Understanding at Sub-Billion Scale, a research team from Adam Mickiewicz University, Jagiellonian University and Warsaw University of Technology presents Arctic-TILT, a model that is specifically engineered for large-scale, cost-effective deployment while also being adaptable to various domains. It achieves state-of-the-art performance on benchmarks for both business and long documents.
Arctic-TILT builds upon the TILT encoder-decoder model, which is itself an extension of T5. The model enhances the traditional sequential positional bias by incorporating attention biases based on the relative horizontal and vertical distances between token pairs. Additionally, it integrates contextualized image embeddings that represent the semantics of token image regions within their broader visual context.
To further advance TILT’s capabilities, the research team introduces several innovations: a novel approach to modality fusion, the incorporation of attention sparsity, improvements to the training process, and optimizations for both training and inference. The enhanced model is known as Arctic-TILT.
A key feature of the TILT model is its method of merging visual and textual semantics. This is achieved by summing word embeddings with RoI-pooled representations of the word’s bounding box, using a variant of the UNet network as the image encoder. The Arctic-TILT encoder block then fuses contextualized visual data from the U-Net with textual semantics from input embeddings. The model also augments Multi-Head Attention with 1D and 2D positional biases to effectively capture spatial and sequential relationships. This process is repeated across layers, allowing for deeper integration of information.
Empirical results show that Arctic-TILT matches the accuracy of models that are 1,000 times larger, making it an efficient choice for processing Visually Rich Documents with up to 400,000 tokens. Remarkably, it can be fine-tuned and deployed on a single 24GB GPU, significantly reducing operational costs. Arctic-TILT sets new benchmarks in seven diverse Document Understanding tasks, offering reliable confidence scores and fast inference—crucial for large-scale or time-sensitive enterprise operations.
The paper Arctic-TILT. Business Document Understanding at Sub-Billion Scale is on arXiv.
Author: Hecate He | Editor: Chain Zhang

