Comparative Study on Image Captioning

Hardik K Patel; Jagdish M Rathod

doi:10.47164/ijngc.v13i4.769

Published Nov 18, 2022

https://doi.org/10.47164/ijngc.v13i4.769

Download

PDF

Statistic

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Volume 13, Issue 4, November 2022

Hardik K Patel

Gujarat Technological University

Jagdish M Rathod

Gujarat Technological University

Abstract

Image captioning is a technique to generate a correct description of an image with proper structure. Now a days, Machines can describe the images and convert it into suitable language which is grammatically true. In this paper, we reviewed the existing image captioning techniques in details, divided into different categories, datasets, and results. We are focusing on those methods, which are correct grammatically and syntactically and related to deep learning (Neural network). We have also compared and reviewed the different datasets like COCO, flicker8K, etc., and results of different image captioning techniques.

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Hardik K Patel, & Jagdish M Rathod. (2022). Comparative Study on Image Captioning . International Journal of Next-Generation Computing, 13(4). https://doi.org/10.47164/ijngc.v13i4.769

References

Bach, F. R., & Jordan, M. I. (07/02). Kernel Independent Component Analysis. Journal of Machine Learning, 48.
Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304. https://doi.org/10.1016/j.neucom.2018.05.080 DOI: https://doi.org/10.1016/j.neucom.2018.05.080
Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2422–2431. https://doi.org/10.1109/CVPR.2015.7298856 DOI: https://doi.org/10.1109/CVPR.2015.7298856
Dekang Lin. (1998). An information-theoretic definition of similarity. Icml, 98.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1473–1482. https://doi.org/10.1109/CVPR.2015.7298754
Han, Y., & Li, G. (2015). Describing Images with Hierarchical Concepts and Object Class Localization. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 251–258. https://doi.org/10.1145/2671188.2749290 DOI: https://doi.org/10.1145/2671188.2749290
Hao Fang, Forrest Iandola, Saurabh Gupta, & Rupesh K. Srivastava. (2015). From Captions to Visual Concepts and Back. IEEE Conference on Computer Vision and Pattern Recognition, 1473–1482. DOI: https://doi.org/10.1109/CVPR.2015.7298754
Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, 16(12), 2639–2664. https://doi.org/10.1162/0899766042321814 DOI: https://doi.org/10.1162/0899766042321814
Hodosh, M., Young, P., & Hockenmaier, J. (n.d.). Framing Image Description as a Ranking Task Data, Models and Evaluation Metrics Extended Abstract. 5.
Hossain, MD. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A Comprehensive Survey of Deep Learning for Image Captioning. ACM Computing Surveys, 51(6), 1–36. https://doi.org/10.1145/3295748 DOI: https://doi.org/10.1145/3295748
Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., Mattern, F., Mitchell, J. C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M. Y., Weikum, G., Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., … Forsyth, D. (2010). Every Picture Tells a Story: Generating Sentences from Images. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer Vision – ECCV 2010 (Vol. 6314, pp. 15–29). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15561-1_2 DOI: https://doi.org/10.1007/978-3-642-15561-1_2
J. Curran, S. Clark, & J. Bos. (n.d.). Linguistically motivated large-scale NLP with CC and boxer. Forty Fifth Annual Meeting of the ACL on Inter- Active Poster and Demonstration Sessions, 33–36.
Jia, X., Gavves, E., Fernando, B., & Tuytelaars, T. (2015). Guiding the Long-Short Term Memory Model for Image Caption Generation. 2015 IEEE International Conference on Computer Vision (ICCV), 2407–2415. https://doi.org/10.1109/ICCV.2015.277 DOI: https://doi.org/10.1109/ICCV.2015.277
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932 DOI: https://doi.org/10.1109/CVPR.2015.7298932
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2013). BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903. https://doi.org/10.1109/TPAMI.2012.162 DOI: https://doi.org/10.1109/TPAMI.2012.162
Li, X., Song, X., Herranz, L., Zhu, Y., & Jiang, S. (2016). Image Captioning with both Object and Scene Information. Proceedings of the 24th ACM International Conference on Multimedia, 1107–1110. https://doi.org/10.1145/2964284.2984069 DOI: https://doi.org/10.1145/2964284.2984069
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). https://doi.org/10.48550/ARXIV.1412.6632
Margaret Mitchell, Jesse Dodge, Amit Goyal, & Kota Yamaguchi. (2012). Midge: Generating Image Descriptions from Computer Vision Detections. 747–756.
Minsi Wang, Li Song, Xiaokang Yang, & Chuanfei Luo. (2016). A parallel-fusion RNN-LSTM architecture for image caption generation. IEEE International Conference, 4448–4452.
Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2Text: Describing images using 1 million captioned photographs. Proceedings of the 24th International Conference on Neural Information Processing Systems, 1143–1151.
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135 DOI: https://doi.org/10.3115/1073083.1073135
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. 2015 IEEE International Conference on Computer Vision (ICCV), 2641–2649. https://doi.org/10.1109/ICCV.2015.303 DOI: https://doi.org/10.1109/ICCV.2015.303
Ryan Kiros & Kelvin Xu. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 32 Nd International Conference on Machine Learning, Lille, France, 37.
Ryan Kiros & Richard SZemel. (2014). Unifying visual-semantic embeddings with multimodal neural language models. Neural Information Processing Systems (NIPS).
Ryan Kiros, Ruslan Salakhutdinov, & Rich Zemel. (n.d.). Multimodal neural language models. 31st International Conference on Machine Learning (ICML-14), 595–603.
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093 DOI: https://doi.org/10.1109/78.650093
Shubo Ma & Yahong Han. (2016). Describing Images By Feeding Lstm With Structuralwords. IEEE International Conference, Multimedia and Expo (ICME).
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition (arXiv:1409.1556). arXiv. http://arxiv.org/abs/1409.1556
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. https://doi.org/10.48550/ARXIV.1409.3215
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935 DOI: https://doi.org/10.1109/CVPR.2015.7298935
Wang, M., Song, L., Yang, X., & Luo, C. (2016). A parallel-fusion RNN-LSTM architecture for image caption generation. 2016 IEEE International Conference on Image Processing (ICIP), 4448–4452. https://doi.org/10.1109/ICIP.2016.7533201 DOI: https://doi.org/10.1109/ICIP.2016.7533201
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016a). Review Networks for Caption Generation (arXiv:1605.07912). arXiv. http://arxiv.org/abs/1605.07912
Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016b). Review Networks for Caption Generation. In 30th Conference on Neural Image Processing System (NIPS). http://arxiv.org/abs/1605.07912
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016a). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651–4659. https://doi.org/10.1109/CVPR.2016.503
You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016b). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651–4659. https://doi.org/10.1109/CVPR.2016.503 DOI: https://doi.org/10.1109/CVPR.2016.503

About Journal

##plugins.themes.academic_pro.article.sidebar##

Downloads

Metrics

##plugins.themes.academic_pro.article.main##

Abstract

##plugins.themes.academic_pro.article.details##

References