Arabic is the 4th most-used language on the Internet, and its growing presence on social media is providing ample resources for the study of Arabic-language online communities at scale. There are however few tools currently available that can derive valuable insights from this data for decision making, guiding policies, aiding in responses, etc. Is that about to change?
The performance of natural language processing (NLP) systems has dramatically improved on tasks such as reading comprehension and natural language inference, and with these advances have come many new application scenarios for the tech. Unsurprisingly, English is where most NLP R&D has been focused. Now, a team of researchers from the Natural Language Processing Lab at the University of British Columbia in Canada have proposed AraNet, a deep learning toolkit designed for Arabic social media processing.
AraNet includes identifier tools that can predict age, dialect, gender, emotion, irony, sentiment, etc. from social media texts. AraNet is built on the framework of Google’s new BERT-Base Multilingual Cased model, which was trained on 104 languages — including Arabic — and was recommended for the job by the BERT team.
The neural network-based technique for NLP pretraining can be easily fine-tuned on a large amount of sentence-level and token-level tasks. Such traits suit the needs of researchers to exploit an extensive host of accessible social media datasets — mostly sourced from Twitter — to train models accordingly. Only the datasets used for sentiment analysis were different.
To train the models to predict age and gender for example, researchers adopted two datasets. The large-scale and multi-dialectal corpus Arap-Tweet includes Tweets from 11 regions and 16 countries in the Arab world, representing a wide range of Arabic dialects. The researchers also created their own Twitter dataset for gender, collecting 69,509 tweets from 528 male users and 67,511 tweets from 528 female users from across 21 Arabic-speaking countries.
To perform sentiment analysis, researchers used 15 datasets containing MSA (Modern Standard Arabic) and various regional dialects. Although the datasets involve different types of sentiment analysis tasks such as binary classification, three-way classification or subjective language detection, researchers combined them for binary sentiment classification.
The researchers did not explicitly compare their baseline models for some of the tasks to previous research, explaining “most existing works either exploit smaller data (and so it will not be a fair comparison), or use methods pre-dating BERT (and so will likely be outperformed by our models).”
It’s believed AraNet’s unified framework based on the BERT model will enable future studies to more easily implement various NLP tasks targeting Arabic social media and generate insightful observations. More importantly, the researchers hope the toolkit can provide a gateway to improved understanding of contemporary Arabic online communities.
Although the complexity of the language and other challenges remain for Arabic NLP, the project is expected to bring additional academic attention and advancements to this research field.
The paper AraNet: A Deep Learning Toolkit for Arabic Social Media is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen