Harnessing Code Switching to Transcend the Linguistic Barrier

Harnessing Code Switching to Transcend the Linguistic Barrier

Ashiqur R. KhudaBukhsh, Shriphani Palakodety, Jaime G. Carbonell

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Special track on AI for CompSust and Human well-being. Pages 4366-4374. https://doi.org/10.24963/ijcai.2020/602

Code mixing (or code switching) is a common phenomenon observed in social-media content generated by a linguistically diverse user-base. Studies show that in the Indian sub-continent, a substantial fraction of social media posts exhibit code switching. While the difficulties posed by code mixed documents to further downstream analyses are well-understood, lending visibility to code mixed documents under certain scenarios may have utility that has been previously overlooked. For instance, a document written in a mixture of multiple languages can be partially accessible to a wider audience; this could be particularly useful if a considerable fraction of the audience lacks fluency in one of the component languages. In this paper, we provide a systematic approach to sample code mixed documents leveraging a polyglot embedding based method that requires minimal supervision. In the context of the 2019 India-Pakistan conflict triggered by the Pulwama terror attack, we demonstrate an untapped potential of harnessing code mixing for human well-being: starting from an existing hostility diffusing hope speech classifier solely trained on English documents, code mixed documents are utilized to perform cross-lingual sampling and retrieve hope speech content written in a low-resource but widely used language - Romanized Hindi. Our proposed pipeline requires minimal supervision and holds promise in substantially reducing web moderation efforts. A further exploratory study on a new COVID-19 data set introduced in this paper demonstrates the generalizability of our cross-lingual sampling technique.
Keywords:
Natural Language Processing: Natural Language Processing
Natural Language Processing: Information Retrieval
Natural Language Processing: Embeddings
Data Mining: Mining Text, Web, Social Media