Publication

A sentiment analysis dataset for code-mixed Malayalam-English

Chakravarthi, Bharathi Raja
Jose, Navya
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
Citation
Chakravarthi, Bharathi Raja, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth, & McCrae, John P. (2020). A sentiment analysis dataset for code-mixed Malayalam-English. Paper presented at the Language Resources and Evaluation Conference (LREC 2020), 1st Joint Workshop of SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Computing for Under-Resourced Languages), Marseille, France, 11-16 May.
Abstract
There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts
Publisher
European Language Resources Association (ELRA)
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivs 3.0 Ireland