An Application of MDL Principle for Indian Resource Poor Language

##plugins.themes.academic_pro.article.main##

Miral Patel
Apurva Shah

Abstract

Stemmer is very important and required module for any morphological system. Stemming process is language dependent, which separates stem and suffix from a given word. Even after notable growth, specifically work at morphological level for Indian resource poor languages like Sanskrit, Assamese, Bengali, Bishnupriya, Manipuri, Bodo etc. are less attended. Standard resources (corpus, data set) for experiment are very scarce for such languages. Many famous unsupervised approaches are tested for European languages only. It is the requirement to see how well famous approach works for other inflective and resource poor languages. In this study, Minimum Description Length principle (MDL) is applied to Sanskrit (resource poor and inflective) language. Initially, all corpus lexicon are split in to substring, which is followed by calculating frequency and length of each sub string. A higher probability split is considered as best split for stem and suffix. Next, multiple iteration is taken until result improved. With 72 % result MDL works well for Indian language. MDL principle is extended to improve performance of Sanskrit stemmer by adding rule based approach. MDL based hybrid approach improves result by 17 %. As no direct Sanskrit stemmer or evaluation is available to compare, therefore, we compare our work with Lovin, Porter and Paice stemmers. Word stemmed factor is highest compared which to all three stemmer. Our results are also comparable to Gujarati and Punjabi language stemmer. Stemmer strength is more as it reduces under stemming errors.

##plugins.themes.academic_pro.article.details##

How to Cite
Patel, M., & Shah, A. . (2017). An Application of MDL Principle for Indian Resource Poor Language. International Journal of Next-Generation Computing, 8(3), 186–197. https://doi.org/10.47164/ijngc.v8i3.132

References

  1. Amaresh Kumar Pandey, T. J. S. 2008. No Title. In Proceeding AND '08 Proceedings of the second workshop on Analytics for noisy unstructured text data. 99-105.
  2. Ameta, J., Joshi, N., and Mathur, I. 2011. A Lightweight Stemmer for Gujarati. In In Proceedings of 46th Annual National Convention of Computer Society of India.
  3. Bhadra, M., Singh, S. K., Kumar, S., Subash, Agrawal, M., Chandrasekhar, R., Mishra, S. K., and Jha, G. N. 2009. Sanskrit Analysis System (SAS). In Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics). 1-20.
  4. Bhamidipati, N. L. and Pal, S. K. 2007. Stemming via distribution-based word segregation for classi cation and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 37, 2, 350-360.
  5. Brigs, R. 1985. Knowledge Representation in Sanskrit and Arti cial Intelligence. THE AI Megazine 6, 1, 32-39.
  6. Caumanns, J. 1999. A Fast and Simple Stemming Algorithm for German Words. Technial Reports B 99/16, 10.
  7. Dolamic, L. and Savoy, J. 2009. Indexing and stemming approaches for the Czech language. Information Processing & Management 45, 6 (nov), 714-720.
  8. Goldsmith, J. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27, 2 (jun), 153-198.
  9. Goyal, P., Huet, G., Kulkarni, A., Scharf, P., and Bunker, R. 2012. A Distributed Platform for Sanskrit Processing. In Proceedings of COLLING 2012: Techncial papers. mumbai, 1011-1028.
  10. Hammarstrom, H. and Borin, L. 2011. Unsupervised Learning of Morphology. Computational Linguistics 37, 2 (jun), 309-350.
  11. Huet, G. 2009. Formal structure of sanskrit text: Requirements analysis for a mechanical sanskrit processor. In Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics).
  12. Jha, G. N., Agrawal, M., Subash, Mishra, S. K., Mani, D., Mishra, D., Bhadra, M., and Singh, S. K. 2009. Insectional morphology analyzer for sanskrit. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics).
  13. Kumar, D. and Rana, P. 2010. Design and Development of a Stemmer for Punjabi. International Journal of Computer Applications 11, 12, 18-23.
  14. Lovins, J. B. 1968. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22-31.
  15. Majumder, P., Mitra, M., and Pal, D. 2008. Bulgarian, Hungarian and Czech Stemming Using YASS. In Advances in Multilingual and Multimodal Information Retrieval. Vol. 5152. Springer Berlin Heidelberg, Berlin, Heidelberg, 49-56.
  16. Mayfield, J. and McNamee, P. 2003. Single n-gram stemming. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval - SIGIR '03. Vol. 1. ACM Press, New York, New York, USA, 415-416.
  17. McNamee, P. and Mayfield, J. 2007. N-Gram Morphemes for Retrieval. Working Notes for the CLEF 2007 Workshop, 19-21 September, Budapest, Hungary.
  18. Nehar, A., Ziadi, D., Cherroun, H., and Guellouma, Y. 2012. An ecient stemming for Arabic Text Classi cation. In 2012 International Conference on Innovations in Information Technology, IIT 2012. Abu Dhabi, 328-332.
  19. Paice, C. 2006. Stemming. In Encyclopedia of Language & Linguistics, K. Brown, Ed. Elsevier, 149-150.
  20. Paik, J. H., Mitra, M., Parui, S. K., and Jarvelin, K. 2011. Gras. ACM Transactions on Information Systems 29, 4 (nov), 1-24.
  21. Porter, M. 2001. Snowball: A language for stemming algorithms.
  22. Porter, M. F. 1980. The Porter Stemmer Algorithm. 14, 3, 130-137.
  23. Ramanathan, A. and Rao, D. D. 2003. A Lightweight Stemmer for Hindi. In In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), on Computatinal Linguistics for South Asian Languages. BU.
  24. Saharia, N. 2010. A Sux-based Noun and Verb Classi er for an Insectional Language. 19-22.
  25. Saharia, N., Konwar, K. M., Sharma, U., and Kalita, J. K. 2013. An Improved Stemming Approach Using HMM for a Highly Insectional Language.
  26. Sheth, J. R. and Patel, B. C. 2012. Stemming Techniques and Nave Approach for Gujarati Stemmer. In nternational Conference in Recent Trends in Information Technology and Computer Science (ICRTITCS - 2012) Proceedings published in International Journal of Computer Applications. IJCA, chennai, 975-8887.
  27. Smirnov, I. 2008. Overview of stemming algorithms. Mechanical Translation, 1-8.
  28. Suba, K., Jiandani, D., and Bhattacharyya, P. 2011. Hybrid Insectional Stemmer and Rule-based Derivational Stemmer for Gujarati. In Proceedings of the 2nd Workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP. Chiang Mai, 1-8.