Text-based Language Identifier using Multinomial Naïve Bayes Algorithm

Sunita Rawat; Lakshita Werulkar; Sagarika Jaywant

doi:10.47164/ijngc.v14i1.1024

Published Feb 15, 2023

https://doi.org/10.47164/ijngc.v14i1.1024

Download

pdf

Statistic

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Volume 14, Special Issue 1, February 2023

Sunita Rawat

Shri Ramdeobaba College of Engineering and Management, Nagpur

Lakshita Werulkar

Shri Ramdeobaba College of Engineering and Management, Nagpur

Sagarika Jaywant

Shri Ramdeobaba College of Engineering and Management, Nagpur

Abstract

Language Identification is among the crucial steps in any NLP based application. Text - based documents and webpages are rapidly increasing in the modern Internet. It is simple to locate documents written in different languages from all across the world that are available with just one click. Therefore, a language identifier is absolutely necessary in order to help the user interpret the content. Language identification has so far tended to be more concentrated on European languages and is still rather limited for Indian Traditional Languages. Many researchers have become more interested in the study of language identification for similar languages from popular languages. In this paper, Multinomial Na¨ıve Bayes Algorithm is used for detecting languages in Devanagari like Marathi, Sanskrit and Hindi, and three European languages French, Italian and English. An experiment done on
datasets of each language has produced satisfactorily accurate results after training and testing the model.

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Rawat, S., Werulkar, L., & Jaywant, S. (2023). Text-based Language Identifier using Multinomial Naïve Bayes Algorithm. International Journal of Next-Generation Computing, 14(1). https://doi.org/10.47164/ijngc.v14i1.1024

References

Abbas, M., Ali, K., Memon, S., Jamali, A., Memon, S., and Ahmed, A. 2019. Multinomial na¨ıve bayes classification model for sentiment analysis. IEEE Transactions on Reliability.
Cahyani, D. and Patasik, I. 2021. Performance comparison of tf-idf and word2vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics 10, 2780–2788. DOI: https://doi.org/10.11591/eei.v10i5.3157
Cavnar, W. and Trenkle, J. 2001. N-gram-based text categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F. ´ , Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. 2020. Unsupervised crosslingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8440–8451. DOI: https://doi.org/10.18653/v1/2020.acl-main.747
Hao, J. and Ho, T. K. 2019. Machine learning made easy: A review of scikit-learn package in python programming language. Journal of Educational and Behavioral Statistics 44, 3, 348–361. DOI: https://doi.org/10.3102/1076998619832248
Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M. M., and Kumar, P. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4948–4961. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.445
Koehn, P. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers. Phuket, Thailand, 79–86.
Kunchukuttan, A., Mehta, P., and Bhattacharyya, P. 2018. The IIT Bombay English Hindi parallel corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan.
Pedregosa, F., V. G. G. A. M. V. T. B. Scikit-learn: Machine learning in python.
Rawat, S. 2015a. A comparative study on different approaches to word sense disambiguation.
Rawat, S. 2015b. A review on word sense disambiguation. International Journal of Innovative
Research in Computer Communications Engineering.
Rawat, S. 2015c. Word sense disambiguation and classification algorithms: A review. International Journal of Computer Science and Applications.
Rawat, S. 2016a. An approach for efficient machine translation using translation memory. DOI: https://doi.org/10.1007/978-981-10-3433-6_34
Rawat, S. 2016b. Comparative survey of document analysis and categorization techniques.
Rawat, S. 2017. An approach for improving accuracy of machine translation using wsd and DOI: https://doi.org/10.26438/ijcse/v5i10.256259
giza. International Journal of Computer Sciences and Engineering.
Rawat, S. 2019. Supervised word sense disambiguation using decision tree. International Journal
of Recent Technology and Engineering (IJRTE), 4043–4047.
Rawat, S. 2022. A method to integrate word sense disambiguation and translation memory for english to hindi machine translation system. Computer Assisted Methods in Engineering and Science (CAMES).
Rish, I. et al. 2001. An empirical study of the naive bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence. Vol. 3. 41–46.
Tayal, M. and Tayal, A. 2021. Darnn: Discourse analysis for natural languages using rnn and lstm: -. International Journal of Next-Generation Computing. DOI: https://doi.org/10.47164/ijngc.v12i5.471
Wenzek, G., Lachaux, M.-A., Conneau, A., Chaudhary, V., Guzman, F. ´ , Joulin, A., and Grave, E. 2020. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 4003–4012.
Xu, S., Li, Y., and Wang, Z. 2017. Bayesian multinomial na¨ıve bayes classifier to text classification. In Advanced Multimedia and Ubiquitous Engineering, J. J. J. H. Park, S.-C. Chen, and K.-K. Raymond Choo, Eds. Springer Singapore, Singapore, 347–352. DOI: https://doi.org/10.1007/978-981-10-5041-1_57

About Journal

Text-based Language Identifier using Multinomial Naïve Bayes Algorithm

Downloads

Metrics

Abstract

References

Most read articles by the same author(s)

About Journal

##plugins.themes.academic_pro.article.sidebar##

Downloads

Metrics

##plugins.themes.academic_pro.article.main##

Abstract

##plugins.themes.academic_pro.article.details##

References

Most read articles by the same author(s)