
A survey of current datasets for code-switching research

Jose, Navya
Chakravarthi, Bharathi Raja
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
Jose, Navya, Chakravarthi, Bharathi Raja, Suryawanshi, Shardul, Sherly, Elizabeth, & McCrae, John P. (2020). A survey of current datasets for code-switching research. Paper presented at the 6th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 06-07 March, doi: 10.1109/ICACCS48705.2020.9074205.
Code switching is a prevalent phenomenon in the multilingual community and social media interaction. In the past ten years, we have witnessed an explosion of code switched data in the social media that brings together languages from low resourced languages to high resourced languages in the same text, sometimes written in a non-native script. This increases the demand for processing code-switched data to assist users in various natural language processing tasks such as part-ofspeech tagging, named entity recognition, sentiment analysis, conversational systems, and machine translation, etc. The available corpora for code switching research played a major role in advancing this area of research. In this paper, we propose a set of quality metrics to evaluate the dataset and categorize them accordingly.
Publisher DOI
Attribution-NonCommercial-NoDerivs 3.0 Ireland