InfoSift: A Novel, Mining-Based Framework for Document Classification

##plugins.themes.academic_pro.article.main##

Sharma Chakravarthy
Aravind Venkatachalam
Aditya Telang
Manu Aery

Abstract

A number of approaches, including machine learning, probabilistic, and information retrieval, have been proposedfor classifying/retrieving documents where mainly words from the documents are used without considering anypotential structural properties of the document. These techniques do not specifically exploit: structural infor-mation that may be present in these documents or the importance of groups of terms that co-occur in differentparts of the documents and their relationships. However, many documents, such as emails, web pages, and textdocuments have a basic structure which can be beneficially leveraged for the purposes of classification/retrieval.This paper proposes a novel, graph-based mining framework for document classification by taking into accountthe structure of a document. Our approach is based on the intuition that representative - common and recurring - structures or patterns (not just words) can be extracted from a pre-classified document class and similaritywith these extracted patterns can be effectively used for classifying incoming/new documents. To the best of ourknowledge, this approach has not been tried for the classification of text, email or web pages (in general documents).First, we establish the applicability of this approach by identifying a suitable graph mining technique. Next, weestablish relevant parameters and their derivation from the corpus characteristics. The notion of inexact graphmatch is critical for our approach both for extracting substructures as well as for identifying similar substructures.Second, we extend our approach to multi-class classification which is essential for real-world applications.Ranking of substructures globally (across classes) is needed for this purpose. A TF-IDF-like formula is proposedfor global ranking of substructures. Approaches proposed for the computation of the components of the rankingformula are discussed along with their computation challenges. Finally, extensive experimental analysis is carriedout for three diverse document types (emails, text, and web pages) to demonstrate the generality and effectivenessof the proposed framework. We believe that we have established the efficacy of this framework for the classificationof any input that exhibits some structure, and furthermore can be extended to other forms of inputs.

##plugins.themes.academic_pro.article.details##

How to Cite
Sharma Chakravarthy, Aravind Venkatachalam, Aditya Telang, & Manu Aery. (2014). InfoSift: A Novel, Mining-Based Framework for Document Classification. International Journal of Next-Generation Computing, 5(2), 84–122. https://doi.org/10.47164/ijngc.v5i2.62

References

  1. AERY, M. 2004. Infosift: Adapting Graph Mining Techniques for Document Classification. M.S. thesis, The University of Texas at Arlington.
  2. AERY, M. AND CHAKRAVARTHY, S. 2005a. eMailSift: Email Classification Based on Structure and Content. In ICDM. 18–25.
  3. AERY, M. AND CHAKRAVARTHY, S. 2005b. InfoSift: Adapting Graph Mining Techniques for Text Classification. In FLAIRS Conference. 277–282.
  4. AGARWAL, R., IMIELINSKI, T., AND SWAMI, A. 1993. Mining association rules between sets of items in large databases. Proceedings of 1993 ACM SIGMOD Conference, 207–216.
  5. APTE, C., DAMERAU, F., AND WEISS, S. M. 1998. Text mining with decision trees and decision rules. Conference on Automated Learning and Discovery.
  6. APTE, C., F.DAMERAU, AND WEISS, S. 1994. Towards language independent automated learning of text categorization models.In the Proceedings of the 17th Annual ACM/SIGIR conference, 23–30.
  7. ATTARDI, G., GULLI, A., AND SEBASTIANI, F. 1999. Automatic we page categorization by link and context analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI’99, 105–119.
  8. BAKER, L. D. AND MCCALLUM, A. K. 1998. Distributional clustering of words for text categorization. In Proceedings of the 21st Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR’98), 96–103.
  9. BOONE, G. 1998. Concept features in re:agent, an intelligent email agent. Proc. Agents, 141–148.
  10. BRUTLAG, J. D. AND MEEK, C. 2000. Challenges of the email domain for text classification. Proceedings of ICML, 103–110.
  11. BUNKE, H. AND ALLERMAN, G. 1983. Inexact graph match for structural pattern recognition. Pattern Recognition Letters, 245–253.
  12. BUNKE, H. AND SHEARER, K. 2001. A graph distance metric based on maximal common subgraph. Pattern Recognition Letters, 753–758.
  13. CATLETT, J. 1991. Megainduction: A test flight. Proc. International Conference on Machine Learning, 596–599.
  14. CHAKRAVARTHY, S., VENKATACHALAM, A., AND TELANG, A. 2010. A graph-based approach for multi-folder email classification. In ICDM. 78–87.
  15. CLARK, P. AND NIBLETT, T. 1989. The cn2 induction algorithm. Machine Learning, 261–283.
  16. COHEN, W. W. 1995. Text categorization and relational learning. In the Twelfth International Conference on Machine Learning (ICML’95), 124–132.
  17. COHEN, W. W. 1996. Learning rules that classify e-mail. Proceedings of AAAI-1996 Spring Symposium on Machine Learning in Information Access, 124–143.
  18. COHEN, W. W. AND SINGER, Y. 1995. A simple, fast, and effective rule learner. In Proceedings of the National Conference on Artificial Intelligence, 335–342.
  19. COHEN, W. W. AND SINGER, Y. 1996. Context-sensitive methods for text categorization. In SIGIR’96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development of Information Retrieval, 307–315.
  20. COOK, D. J. AND HOLDER, L. B. 1994. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research 1, 231–255.
  21. COOK, D. J. AND HOLDER, L. B. 2000. Graph based data mining. IEEE Intelligent Systems 15(2), 32–41.
  22. CRAWFORD, E., KAY, J., AND MCCREATH, E. 2001. Automatic induction of rules for e-mail classification. Proceedings of the Sixth Australasian Document Computing Symposium, Coffs Harbour, Australia.
  23. FRAWLEY, W., PIATETSKY-SHAPIRO, G., AND MATHEUS, C. 1992. Knowledge discovery in databases: An overview. AI Magazine, 213–228.
  24. GUYON, I. AND ELISSEEFF, A. 2003. An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182.
  25. HEFLMAN, J. AND ISBELL, C. 1995. Ishmail: Immediate identification of important information, AT&T labs.
  26. JOACHIMS, T. 1998. Text categorization with support vector machines: Learning with many relevant features. ECML, 137–142.
  27. KETKAR, N. S., HOLDER, L. B., AND COOK, D. J. 2005. Subdue: compression-based frequent pattern discovery in graph data.In OSDM ’05: Proceedings of the 1st international workshop on open source data mining. ACM Press, New York, NY, USA, 71–76.
  28. KIRITCHENKO, S., MATWIN, S., AND ABU-HAKIMA, S. 2004. Email classification with temporal features. In Intelligent Information Systems. 523–533.
  29. KOLLER, D. AND SAHAMI, M. 1997. Heirarchically classifying text using very few words. In the 14th International Conference on Machine Learning (ICML’97), 170–178.
  30. KURAMOCHI, M. AND KARYPIS, G. 2001. Frequent subgraph discovery. IEEE International Conference on Data Mining, 313– 320.
  31. LAM, W. AND HO, C. 1998. Using a generalized instance set for automatic text categorization. In the Proceedings of the 21st Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR’98), 81–89.
  32. MASAND, B., LINOFF, G., AND WALTZ, D. 1992. Classifying news stories using memory based reasoning. In the 15th Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR’92), 59–64.
  33. MCCALLUM, A. K. AND NIGAM, K. 1992. A comparison of event models for naive bayes text classification. In the 15th Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR’92), 59–64.
  34. MOLINIER, I. 1997. Is learning bias an issus on the text categorization problem? Technical Report LAFORIA-LIP6, Universite Paris VI.
  35. MOULINIER, I., RASKINIS, G., AND GANASCIA, J. 1996. Text categorization: a symbolic approach. In the Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, 87–99.
  36. NG, H. T., GOH, W. B., AND LOW, K. L. 1997. Feature selection, perceptron learning and a usability case study for text categorization. In the 20th Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval(SIGIR’97), 67–73.
  37. PAYNE, T. AND EDWARDS, P. 1997. Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 1–32.
  38. RENNIE, J. D. M. 2000. ifile:an application of machine learning to e-mail filtering. Proceedings KDD Workshop on Text Mining,Boston, MA, 92–100.
  39. RISSANEN, J. 1989. Stochastic complexity in statistical enquiry. World Publishing Company.
  40. SAHAMI, M., HECKERMAN, D., AND HOROVITZ, E. 1998. A bayesian approach to filtering junk e-mail. AAAI-98 Workshop on Learning for Text Categorization, 98–105.
  41. SALTON, G. 1991. Developments in automatic text retrieval. Science 253, 947–980.
  42. SCHENKER, A., LAST, M., BUNKE, H., AND KANDEL, A. 2003. Classification of web documents using a graph model. Seventh International Conference on Document Analysis and Recognition, 240–244.
  43. SEGAL, R. B. AND KEPHART, J. O. 2000. Swiftfile: An intelligent assistant for organizing e-mail. Proceedings of AAAI 2000 Spring Symposium on Adaptive User Interfaces, 107–112.
  44. TZERAS, K. AND HARTMAN, S. 1993. Automatic indexing based on bayesian inference networks. In the 16th Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR’93), 22–34.
  45. VAPNIK, V. N. 1982. Estimation of Dependences based on Emperical Data [In Russian]. Nauka, Moscow, 1979. (English Translation Springer Verlag, New York, 1982).
  46. VAPNIK, V. N. 1998. Statistical Learning Theory. John Wiley & Sons.
  47. VENKATACHALAM, A. 2007. m-InfoSift: A Graph based Approach for Multiclass Document Classification. M.S. thesis, The University of Texas at Arlington.
  48. WEINER, E., PEDERSON, J. O., AND WEIGEND, A. S. 1995. A neural network approach to topic spotting. In the Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR’95), 317–332.
  49. WHITTAKER, S. AND SIDNER, C. 1996. Email overload: exploring personal information management of email. Conference Proceedings on Human Factors in Computing Systems, 276–283.
  50. YANG, Y. 1994. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In the 17th Annual Internation ACM SIGIR Conference on Research and Development of Information Retrieval (SIGIR’94), 13–22.
  51. YANG, Y. AND CHUTE, C. G. 1994. An example-based mapping method for text categorization and retrieval. ACM Transactions on Information Systems (TOIS) 12(3), 252–277.
  52. YANG, Y. AND LIU, X. 1999. A re-examination of text categorization methods. ACM SIGIR, 42–49.