Chinese is a logographic language which has evolved over thousands of years. Many Chinese characters have origins in graphics, making it possible for readers who don’t know a word to still guess its approximate meaning. The English alphabet meanwhile is responsible for pronunciation not semantics, making it difficult or impossible to guess semantics from an English glyph. However, most NLP methods applied to the Chinese language use English NLP tasks, where each word or character has a corresponding vector. The approach does not consider the rich information contained in Chinese glyphs.
To bridge the gap between NLP tasks and glyph information understanding, researchers surveyed historic Chinese characters in a variety of writing styles, and designed CNN structures tailored to Chinese character image processing.
Although it seems intuitive that Chinese NLP tasks would benefit from the use of the glyph information in Chinese characters, the lack of rich pictographic evidence in glyphs and relatively weak ability of standard computer vision models to generalize were the first challenges for researchers to tackle.
In modern simplified Chinese script, pictographic information is lacking. Researchers therefore collected historical Chinese scripts such as the Bronzeware script (金书), Clerical script (隶书), Seal script (篆书), Tablet script (魏碑), Traditional Chinese (繁体中文), and Simplified Chinese; all in a variety of different writing styles. Researchers then integrated pictographic evidence into the model to enhance its generalization ability.
Deep CNNs are being widely used in computer vision tasks, but researchers found that directly using deep CNNs on Chinese characters resulted in poor performance. One problem was that while ImageNet images are usually at the scale of 800*600, most Chinese character images in the database are significantly smaller, at an impracticable 12*12 scale. Another reason was the relative lack of training examples.
Researchers turned to a childhood learning tool to help teach Glyce. While young English-learners use the A-B-C Song to memorize the alphabet, Chinese learners use a common four-square formating tool “Tianzige” to learn how to write Chinese characters correctly. Researchers designed a Tianzige-CNN structure tailored to Chinese character modeling.
Last but not least, the team used image-classifications as an auxiliary task to prevent overfitting. In the multi-task learning setup, researchers were able to improve the model’s ability to generalize. Shannon.AI researchers report that Glyce’s glyph-based models can outperform word/char ID-based models in the following Chinese NLP tasks:
1. Character-Level language modeling
2. Word-Level language modeling
3. Chinese word segmentation
4. Name entity recognition
5. Part-of-speech tagging
6. Dependency parsing
7. Semantic role labeling
8. Sentence semantic similarity
9. Sentence intention identification
10. Chinese-English machine translation
11. Sentiment analysis
12. Document classification and
13. Discourse parsing
The success of Glyce indicates the possibility of future research in NLP tasks for Chinese and other logographic languages.
Founded in 2017, Shannon.AI is a Beijing-based artificial intelligence company focused on the financial sector. The paper Glyce: Glyph-vectors for Chinese Character Representations is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen