Many ML studies have introduced systems for deciphering and translating ancient texts into modern language, and these have proven useful to history, archaeology and digital humanities scholars. Now, researchers from the University of Sheffield, Beihang University, and Open University’s Knowledge Media Institute have proposed a transfer learning approach that can automatically process historical texts at a semantic level to generate modern language summaries. The method outperforms standard cross-lingual benchmarks on the task.
Historical text summarization can be regarded as a unique form of cross-lingual summarization. Progress in traditional cross-lingual summarization has however been hindered by limited historical and modern language corpora and evolving vocabulary, spelling, meanings and grammar. Targeting these challenges, the researchers developed a transfer-learning-based approach.
The model was built for the German and Chinese languages, each of which has a rich textual heritage and accessible (monolingual) training resources. German and Chinese also represent alphabetic and ideographic writing systems, respectively, which will facilitate future applications of the method on other languages.
The researchers explain their proposed historical text summarization model is based on a cross-lingual transfer learning framework introduced in the 2019 paperA Survey of Cross-lingual Word Embedding Models, and can be bootstrapped even without cross-lingual (historical to modern) supervision or data.
As this is the first study of its kind on historical text summarization, there were no similar methods for model performance comparisons. The researchers note that such summarizations are mostly required for narrative texts such as news, chronicles, diaries and memoirs, and so constructed a summarization corpus for historical news in German and Chinese, dubbed “HISTSUMM,” with help from human experts in the field.
The team employed two state-of-the-art baselines for standard cross-lingual summarization and conducted extensive automatic and human evaluations based on informativeness, conciseness, fluency and currentness using the standard ROUGE metric. The results show the proposed models are comparable to or slightly outperform baseline approaches for German and are superior by large margins for Chinese. The researchers say the new model also provides a benchmark for future studies in this field.
The paper Summarising Historical Text in Modern Languages is on arXiv, and the associated code and data are on the project GitHub.
Analyst: Reina Qi Wan | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.