Publication

Cross-lingual natural language processing with linguistic typology knowledge

Choudhary, Chinmay
Citation
Abstract
State-of-the-art approaches to most Natural Language Processing (NLP) tasks have achieved near human performance. This recent progress has positively impacted millions of lives and businesses around the world. However, these approaches are neural-network based supervised approaches that require large manually annotated datasets to be trained on. Such datasets are available in only a handful (less than 1%) of high-resource languages. Hence, most of the world’s population is still excluded from the benefits of NLP. The most promising class of approaches proposed by researchers to address this is sue of data-sparsity in low-resource languages is Cross-lingual Model Transfer ap proaches. These approaches typically involve training a neural-network model using a high-resource language called Source language and adapting it to a low-resource lan guage called Target language using cross-lingual/multilingual word-representations. Although these Cross-lingual Model Transfer approaches sufficiently outperform all other types of approaches to various NLP tasks for low-resource languages (such as Cross-lingual Data-transfer approaches, Unsupervised approaches etc.), still they sig nificantly under-perform fully supervised approaches trained on abundant data. In this work we utilised the linguistic typology knowledge available in various open source typology databases to improve the performances of state-of-the-art Cross lingual Model Transfer approaches to four key intermediate NLP tasks namely Con stituency Parsing, Dependency Parsing, Enhanced Dependency Parsing and Semantic Role Labelling. Linguistic typology is the field of linguistics that aims to study and classify all the world’s languages based on their syntactic, semantic and phonological properties. There are numerous publicly available typology databases such as WALS, URIEL, ValPal etc. that provide a taxonomy of typological features and their possible values as well distinct feature-value for each language. These databases are created by the contributions of numerous linguistics over the decades, primarily to study the sim ilarities and distinctions among world’s languages. However, in this work we argue that this typology knowledge can also be utilised by the CLT models to improve their performance. Thus, we propose and evaluate novel cross-lingual approaches to numerous NLP tasks that utilise typology knowledge in this work. We also propose and evaluate various frameworks to inject the typology knowledge available in various open-source databases into the modern neural-network architectures.
Funder
Publisher
NUI Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland
CC BY-NC-ND 3.0 IE