GWO Optimized K-Means Cluster based Oversampling Algorithm

##plugins.themes.academic_pro.article.main##

Santha Subbulaxmi S
Arumugam G

Abstract

Skewed data distribution prevails in many real world applications. The skewedness is due to imbalance in the class distribution and it deteriorates the performance of the traditional classification algorithms. In this paper, we provide a Grey wolf optimized K-Means cluster based oversampling algorithm to handle the skewedness and solve the imbalanced data classification problem. Experiments are conducted on the proposed algorithm and compared it with the benchmarking popular algorithms. The results reveal that the proposed algorithm outperforms the other benchmarking algorithms.

##plugins.themes.academic_pro.article.details##

How to Cite
S, S. S., & G, A. (2021). GWO Optimized K-Means Cluster based Oversampling Algorithm. International Journal of Next-Generation Computing, 12(3), 343–355. https://doi.org/10.47164/ijngc.v12i3.694

References

  1. Alcala-Fdez, J. ´ , Fernandez, A. ´ , Luengo, J., Derrac, J., Garc´ıa, S., Sanchez, L. ´ , and Herrera, F. 2011. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17, .
  2. Barella, V. H., Costa, E. P., Carvalho, A., and PL, F. 2014. Clusteross: a new undersampling method for imbalanced learning. In Proc. of 3th Brazilian Conference on Intelligent Systems. Academic Press. ., .
  3. Cano, A., Zafra, A., and Ventura, S. 2013. Weighted data gravitation classification for standard and imbalanced data. IEEE transactions on cybernetics 43, 6, 1672–1687.
  4. Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. 2002. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357.
  5. Chawla, N. V., Lazarevic, A., Hall, L. O., and Bowyer, K. W. 2003. Smoteboost: Improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery. Springer, ., ., 107–119.
  6. Cho, B. H., Yu, H., Kim, K.-W., Kim, T. H., Kim, I. Y., and Kim, S. I. 2008. Application of irregular and unbalanced data to predict diabetic nephropathy using visualization and feature selection methods. Artificial intelligence in medicine 42, 1, 37–53.
  7. Cieslak, D. A., Hoens, T. R., Chawla, N. V., and Kegelmeyer, W. P. 2012. Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery 24, 1, 136–158.
  8. Dai, D. and Hua, S. 2016. Random under-sampling ensemble methods for highly imbalanced rare disease classification. In Proceedings of the International Conference on Data Science (ICDATA). The Steering Committee of The World Congress in Computer Science, Computer . . . , ., ., 54.
  9. Datta, S. and Das, S. 2015. Near-bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. Neural Networks 70, 39–52.
  10. Galar, M., Fernandez, A. ´ , Barrenechea, E., and Herrera, F. 2013. Eusboost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern recognition 46, 12, 3460–3471.
  11. Gamal, D., Alfonse, M., El-Horbaty, E.-S. M., and Salem, A.-B. M. 2019. Twitter benchmark dataset for arabic sentiment analysis. Int J Mod Educ Comput Sci 11, 1, 33.
  12. Guo, H. and Viktor, H. L. 2004. Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. ACM Sigkdd Explorations Newsletter 6, 1, 30–39.
  13. Hanifah, F. S., Wijayanto, H., and Kurnia, A. 2015. Smotebagging algorithm for imbalanced dataset in logistic regression analysis (case: Credit of bank x). Applied Mathematical Sciences 9, 138, 6857–6865.
  14. Hassib, E. M., El-Desouky, A. I., El-Kenawy, E.-S. M., and El-Ghamrawy, S. M. 2019. An imbalanced big data mining framework for improving optimization algorithms performance. IEEE Access 7, 170774–170795.
  15. Kim, H.-J., Jo, N.-O., and Shin, K.-S. 2016. Optimization of cluster-based evolutionary undersampling for the artificial neural networks in corporate bankruptcy prediction. Expert systems with applications 59, 226–234.
  16. Laurikkala, J. 2001. Improving identification of difficult small classes by balancing class distribution. In Conference on Artificial Intelligence in Medicine in Europe. Springer, ., ., 63–66.
  17. Li, C. 2007. Classifying imbalanced data using a bagging ensemble variation (bev). In Proceedings of the 45th annual southeast regional conference. ., ., 203–208.
  18. Liu, X.-Y., Wu, J., and Zhou, Z.-H. 2008. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2, 539–550.
  19. Mahalingam, B., Kannan, S., and Gurusamy, V. 2017. Enrichment of ensemble learning using k-modes random sampling. ICTACT Journal on Soft Computing 8, 1, .
  20. Martin-Diaz, I., Morinigo-Sotelo, D., Duque-Perez, O., and Romero-Troncoso, R. d. J. 2016. Early fault detection in induction motors using adaboost with imbalanced small data and optimized sampling. IEEE Transactions on Industry Applications 53, 3, 3066–3075.
  21. Menardi, G. and Torelli, N. 2014. Training and assessing classification rules with imbalanced data. Data mining and knowledge discovery 28, 1, 92–122.
  22. Moepya, S. O., Akhoury, S. S., and Nelwamondo, F. V. 2014. Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In 2014 IEEE international conference on data mining workshop. IEEE, ., ., 183–192.
  23. Ng, W. W., Hu, J., Yeung, D. S., Yin, S., and Roli, F. 2014. Diversified sensitivity-based undersampling for imbalance classification problems. IEEE transactions on cybernetics 45, 11, 2402–2412.
  24. Priya, S. and Manavalan, R. 2018. Optimum parameters selection using bacterial foraging optimization for weighted extreme learning machine. ICTACT Journal On Soft Computing 8, 04, .
  25. Ramentol, E., Caballero, Y., Bello, R., and Herrera, F. 2012. Smote-rs b*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowledge and information systems 33, 2, 245–265.
  26. Schubach, M., Re, M., Robinson, P. N., and Valentini, G. 2017. Imbalance-aware machine learning for predicting rare and common disease-associated non-coding variants. Scientific reports 7, 1, 1–12.
  27. Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., and Napolitano, A. 2009. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 1, 185–197.
  28. Tao, D., Tang, X., Li, X., and Wu, X. 2006. Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE transactions on pattern analysis and machine intelligence 28, 7, 1088–1099.
  29. Thammasakorn, C., Chiewchanwattana, S., and Sunat, K. 2018. Optimizing weighted elm based on gray wolf optimizer for imbalanced data classification. In 2018 10th International Conference on Information Technology and Electrical Engineering (ICITEE). IEEE, ., ., 512–517.
  30. Ting, K. M. 2002. An instance-weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering 14, 3, 659–665.
  31. Tomek, I. et al. 1976a. An experiment with the edited nearest-nieghbor rule. . ., ., .
  32. Tomek, I. et al. 1976b. Two modifications of cnn. . ., ., .
  33. Wang, S. and Yao, X. 2009. Diversity analysis on imbalanced data sets by using ensemble models. In 2009 IEEE symposium on computational intelligence and data mining. IEEE, ., ., 324–331.
  34. Wei, H., Sun, B., and Jing, M. 2014. Balancedboost: A hybrid approach for real-time network traffic classification. In 2014 23rd International Conference on Computer Communication and Networks (ICCCN). IEEE, ., ., 1–6.
  35. Wei, W., Li, J., Cao, L., Ou, Y., and Chen, J. 2013. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4, 449–475.
  36. Xu, Y., Yang, Z., Zhang, Y., Pan, X., and Wang, L. 2016. A maximum margin and minimum volume hyperspheres machine with pinball loss for imbalanced data classification. Knowledge-Based Systems 95, 75–85.
  37. Yen, S.-J. and Lee, Y.-S. 2006. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In Intelligent Control and Automation. Springer, ., 731–740.
  38. Yen, S.-J. and Lee, Y.-S. 2009. Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 36, 3, 5718–5727.
  39. Yu, H., Ni, J., and Zhao, J. 2013. Acosampling: An ant colony optimization-based undersampling method for classifying imbalanced dna microarray data. Neurocomputing 101, 309–318.