Comparative Study on Image Captioning

##plugins.themes.academic_pro.article.main##

Hardik K Patel
Jagdish M Rathod

Abstract

Image captioning is a technique to generate a correct description of an image with proper structure. Now a days, Machines can describe the images and convert it into suitable language which is grammatically true. In this paper, we reviewed the existing image captioning techniques in details, divided into different categories, datasets, and results. We are focusing on those methods, which are correct grammatically and syntactically and related to deep learning (Neural network). We have also compared and reviewed the different datasets like COCO, flicker8K, etc., and results of different image captioning techniques.

##plugins.themes.academic_pro.article.details##

How to Cite
Hardik K Patel, & Jagdish M Rathod. (2022). Comparative Study on Image Captioning . International Journal of Next-Generation Computing, 13(4). https://doi.org/10.47164/ijngc.v13i4.769

References

  1. Bach, F. R., & Jordan, M. I. (07/02). Kernel Independent Component Analysis. Journal of Machine Learning, 48.
  2. Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304. https://doi.org/10.1016/j.neucom.2018.05.080 DOI: https://doi.org/10.1016/j.neucom.2018.05.080
  3. Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2422–2431. https://doi.org/10.1109/CVPR.2015.7298856 DOI: https://doi.org/10.1109/CVPR.2015.7298856
  4. Dekang Lin. (1998). An information-theoretic definition of similarity. Icml, 98.
  5. Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1473–1482. https://doi.org/10.1109/CVPR.2015.7298754
  6. Han, Y., & Li, G. (2015). Describing Images with Hierarchical Concepts and Object Class Localization. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 251–258. https://doi.org/10.1145/2671188.2749290 DOI: https://doi.org/10.1145/2671188.2749290
  7. Hao Fang, Forrest Iandola, Saurabh Gupta, & Rupesh K. Srivastava. (2015). From Captions to Visual Concepts and Back. IEEE Conference on Computer Vision and Pattern Recognition, 1473–1482. DOI: https://doi.org/10.1109/CVPR.2015.7298754
  8. Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, 16(12), 2639–2664. https://doi.org/10.1162/0899766042321814 DOI: https://doi.org/10.1162/0899766042321814
  9. Hodosh, M., Young, P., & Hockenmaier, J. (n.d.). Framing Image Description as a Ranking Task Data, Models and Evaluation Metrics Extended Abstract. 5.
  10. Hossain, MD. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A Comprehensive Survey of Deep Learning for Image Captioning. ACM Computing Surveys, 51(6), 1–36. https://doi.org/10.1145/3295748 DOI: https://doi.org/10.1145/3295748
  11. Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., Mattern, F., Mitchell, J. C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M. Y., Weikum, G., Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., … Forsyth, D. (2010). Every Picture Tells a Story: Generating Sentences from Images. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer Vision – ECCV 2010 (Vol. 6314, pp. 15–29). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15561-1_2 DOI: https://doi.org/10.1007/978-3-642-15561-1_2
  12. J. Curran, S. Clark, & J. Bos. (n.d.). Linguistically motivated large-scale NLP with CC and boxer. Forty Fifth Annual Meeting of the ACL on Inter- Active Poster and Demonstration Sessions, 33–36.
  13. Jia, X., Gavves, E., Fernando, B., & Tuytelaars, T. (2015). Guiding the Long-Short Term Memory Model for Image Caption Generation. 2015 IEEE International Conference on Computer Vision (ICCV), 2407–2415. https://doi.org/10.1109/ICCV.2015.277 DOI: https://doi.org/10.1109/ICCV.2015.277
  14. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932 DOI: https://doi.org/10.1109/CVPR.2015.7298932
  15. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
  16. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2013). BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903. https://doi.org/10.1109/TPAMI.2012.162 DOI: https://doi.org/10.1109/TPAMI.2012.162
  17. Li, X., Song, X., Herranz, L., Zhu, Y., & Jiang, S. (2016). Image Captioning with both Object and Scene Information. Proceedings of the 24th ACM International Conference on Multimedia, 1107–1110. https://doi.org/10.1145/2964284.2984069 DOI: https://doi.org/10.1145/2964284.2984069
  18. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). https://doi.org/10.48550/ARXIV.1412.6632
  19. Margaret Mitchell, Jesse Dodge, Amit Goyal, & Kota Yamaguchi. (2012). Midge: Generating Image Descriptions from Computer Vision Detections. 747–756.
  20. Minsi Wang, Li Song, Xiaokang Yang, & Chuanfei Luo. (2016). A parallel-fusion RNN-LSTM architecture for image caption generation. IEEE International Conference, 4448–4452.
  21. Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2Text: Describing images using 1 million captioned photographs. Proceedings of the 24th International Conference on Neural Information Processing Systems, 1143–1151.
  22. Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135 DOI: https://doi.org/10.3115/1073083.1073135
  23. Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. 2015 IEEE International Conference on Computer Vision (ICCV), 2641–2649. https://doi.org/10.1109/ICCV.2015.303 DOI: https://doi.org/10.1109/ICCV.2015.303
  24. Ryan Kiros & Kelvin Xu. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 32 Nd International Conference on Machine Learning, Lille, France, 37.
  25. Ryan Kiros & Richard SZemel. (2014). Unifying visual-semantic embeddings with multimodal neural language models. Neural Information Processing Systems (NIPS).
  26. Ryan Kiros, Ruslan Salakhutdinov, & Rich Zemel. (n.d.). Multimodal neural language models. 31st International Conference on Machine Learning (ICML-14), 595–603.
  27. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093 DOI: https://doi.org/10.1109/78.650093
  28. Shubo Ma & Yahong Han. (2016). Describing Images By Feeding Lstm With Structuralwords. IEEE International Conference, Multimedia and Expo (ICME).
  29. Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition (arXiv:1409.1556). arXiv. http://arxiv.org/abs/1409.1556
  30. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. https://doi.org/10.48550/ARXIV.1409.3215
  31. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935 DOI: https://doi.org/10.1109/CVPR.2015.7298935
  32. Wang, M., Song, L., Yang, X., & Luo, C. (2016). A parallel-fusion RNN-LSTM architecture for image caption generation. 2016 IEEE International Conference on Image Processing (ICIP), 4448–4452. https://doi.org/10.1109/ICIP.2016.7533201 DOI: https://doi.org/10.1109/ICIP.2016.7533201
  33. Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016a). Review Networks for Caption Generation (arXiv:1605.07912). arXiv. http://arxiv.org/abs/1605.07912
  34. Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016b). Review Networks for Caption Generation. In 30th Conference on Neural Image Processing System (NIPS). http://arxiv.org/abs/1605.07912
  35. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016a). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651–4659. https://doi.org/10.1109/CVPR.2016.503
  36. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016b). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651–4659. https://doi.org/10.1109/CVPR.2016.503 DOI: https://doi.org/10.1109/CVPR.2016.503