Improving Sentiment Analysis of Tamil-English Code-Mixed Sentences

Publication Date : Nov-18-2025

DOI: 10.70251/HYJR2348.36574580

Author(s) :

Mohnish Sivakumar.

Volume/Issue :

Volume 3

Issue 6

(Nov - 2025)

Abstract :

This paper investigates sentiment analysis for Tamil-English code-mixed text, a common feature of social media communication in multilingual regions. Code-mixing in Romanized Tamil introduces challenges such as inconsistent spelling, transliteration, and noisy syntax that traditional models are not designed to handle. Using the FIRE-DravidianCodeMix 2020 (hereafter FIRE2020) dataset, lexicon based methods, classical machine learning models, deep learning (LSTM), the multilingual transformer RemBERT, and hybrid approaches combining lexicon-based features with machine learning models were evaluated on sentiment classification. Results showed that classical models such as Logistic Regression, Naive Bayes, and SVM achieved the most stable performance, reaching around 69% accuracy with weighted F1-scores near 0.60. Deep learning and transformer models offered no clear advantage, with both LSTM and RemBERT performing slightly lower than the classical models, plateauing near 67% accuracy and weighted F1-scores around 0.54. These results emphasize that lightweight statistical models remain the most reliable in noisy and resource-constrained code-mixed environments, while deep learning and transformer architectures require greater adaptation to succeed.

American Journal of Student Research®