Unsupervised deep representation learning for low-resourced languages and applications

Goswami, Koustava
Artificial intelligence and Natural Language Processing (NLP) are becoming integral parts of modern tech nologies and amenities starting from the adaptation of Amazon Alexa to automatic chatbots in different industries. Though earlier NLP algorithms were developed mainly for tasks in the English language, re searchers have taken a step to adapt these to different languages around the globe. Despite many advance ments in the NLP domain, these algorithms have quite poor performance in low-resource scenarios. The state-of-the-art multilingual language models suffer due to a lack of high-quality annotated training datasets for low-resource languages and applications. Therefore there is a need of designing efficient automatic NLP systems for low-resource languages and applications to achieve state-of-the-art performances on the down stream NLP tasks. In this thesis, we introduce novel state-of-the-art deep models which capture global and contextual semantic representations of sentences in a document. We focus on building unsupervised deep models to efficiently exploit the existing unlabelled datasets for feature extraction. Our contribution also includes designing state-of-the-art unsupervised sentence embedding models capable of performing a wide range of cross-lingual tasks for low-resource scenarios. We raise several research questions at the start of the thesis and we provide answers supported by state-of-the-art experimental results. While training deep representation models for low-resource languages, one of the key challenges was to recognize the closely-related languages in a code-mixed corpus having short texts in the NLP pipeline. Existing language identification models support a limited set of languages thus we have come up with a novel idea of building an unsupervised language identification system by capturing the global sentence representations of the languages. Our experimental results highlight the efficiency of the framework across different language families including a challenging task, dialect identification. We then turn our interest to explore the capacity of building a cross-lingual contextual sentence embedding framework to be trained on non-parallel sentences. To this extent, we designed an unsupervised sentence embedding framework trained using a dual-encoder architecture in a multitask setup. During training, we also inject the word level semantic knowledge to preserve the relative semantic distance between input sentences. We propose to train this model using a new machine-learning framework called the anchor-learner framework. During evaluations, on a few benchmark experiments, the unsupervised model has outperformed some of the state of-the-art supervised sentence embedding models. While we can capture better cross-lingual contextual sentence representations, there is still a big performance gap for the natural language understanding tasks in comparison with the large language models. By our novel, unsupervised projection methodology of embedding space, the sentence embedding models learn the semantic understanding of a big language model with fewer structural parameters. This produces rich semantic sentence embedding, which in a low resource domain and applications, produces better results for the classification of NLP tasks. Finally, we worked on improving the sentence embeddings for closely-related languages by injecting the knowledge of cognates. One of the hard tasks is to identify cognates. For low-resource languages it is very hard to get annotated cognate pairs, thus we proposed a novel unsupervised cognate detection framework which exploits the grammatical and structural knowledge of the words of a language to be transferred to closely related languages of the same language family. The introduction of these cognates to sentence embedding frameworks during training enhanced the efficacy of the models in cross-lingual tasks. Our contribution also includes the introduction of several unsupervised loss functions that can be applied to different NLP domains and tasks. The methods and frameworks that are developed in this thesis can be implemented to a wide range of low-resource languages and applications with minimum training datasets required.
NUI Galway
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland