Text-based Language Identifier using Multinomial Naïve Bayes Algorithm
Language Identification is among the crucial steps in any NLP based application. Text - based documents and webpages are rapidly increasing in the modern Internet. It is simple to locate documents written in different languages from all across the world that are available with just one click. Therefore, a language identifier is absolutely necessary in order to help the user interpret the content. Language identification has so far tended to be more concentrated on European languages and is still rather limited for Indian Traditional Languages. Many researchers have become more interested in the study of language identification for similar languages from popular languages. In this paper, Multinomial Na¨ıve Bayes Algorithm is used for detecting languages in Devanagari like Marathi, Sanskrit and Hindi, and three European languages French, Italian and English. An experiment done on
datasets of each language has produced satisfactorily accurate results after training and testing the model.
This work is licensed under a Creative Commons Attribution 4.0 International License.
- Abbas, M., Ali, K., Memon, S., Jamali, A., Memon, S., and Ahmed, A. 2019. Multinomial na¨ıve bayes classification model for sentiment analysis. IEEE Transactions on Reliability.
- Cahyani, D. and Patasik, I. 2021. Performance comparison of tf-idf and word2vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics 10, 2780–2788. DOI: https://doi.org/10.11591/eei.v10i5.3157
- Cavnar, W. and Trenkle, J. 2001. N-gram-based text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval.
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F. ´ , Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. 2020. Unsupervised crosslingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. DOI: https://doi.org/10.18653/v1/2020.acl-main.747
- Hao, J. and Ho, T. K. 2019. Machine learning made easy: A review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics 44, 3, 348–361. DOI: https://doi.org/10.3102/1076998619832248
- Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M. M., and Kumar, P. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4948–4961. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.445
- Koehn, P. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers. Phuket, Thailand, 79–86.
- Kunchukuttan, A., Mehta, P., and Bhattacharyya, P. 2018. The IIT Bombay English Hindi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.
- Pedregosa, F., V. G. G. A. M. V. T. B. Scikit-learn: Machine learning in python.
- Rawat, S. 2015a. A comparative study on different approaches to word sense disambiguation.
- Rawat, S. 2015b. A review on word sense disambiguation. International Journal of Innovative
- Research in Computer Communications Engineering.
- Rawat, S. 2015c. Word sense disambiguation and classification algorithms: A review. International Journal of Computer Science and Applications.
- Rawat, S. 2016a. An approach for efficient machine translation using translation memory. DOI: https://doi.org/10.1007/978-981-10-3433-6_34
- Rawat, S. 2016b. Comparative survey of document analysis and categorization techniques.
- Rawat, S. 2017. An approach for improving accuracy of machine translation using wsd and DOI: https://doi.org/10.26438/ijcse/v5i10.256259
- giza. International Journal of Computer Sciences and Engineering.
- Rawat, S. 2019. Supervised word sense disambiguation using decision tree. International Journal
- of Recent Technology and Engineering (IJRTE), 4043–4047.
- Rawat, S. 2022. A method to integrate word sense disambiguation and translation memory for english to hindi machine translation system. Computer Assisted Methods in Engineering and Science (CAMES).
- Rish, I. et al. 2001. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence. Vol. 3. 41–46.
- Tayal, M. and Tayal, A. 2021. Darnn: Discourse analysis for natural languages using rnn and lstm: -. International Journal of Next-Generation Computing. DOI: https://doi.org/10.47164/ijngc.v12i5.471
- Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzman, F. ´ , Joulin, A., and Grave, E. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4003–4012.
- Xu, S., Li, Y., and Wang, Z. 2017. Bayesian multinomial na¨ıve bayes classifier to text classification. In Advanced Multimedia and Ubiquitous Engineering, J. J. J. H. Park, S.-C. Chen, and K.-K. Raymond Choo, Eds. Springer Singapore, Singapore, 347–352. DOI: https://doi.org/10.1007/978-981-10-5041-1_57