Comparative Study on Image Captioning
##plugins.themes.academic_pro.article.main##
Abstract
Image captioning is a technique to generate a correct description of an image with proper structure. Now a days, Machines can describe the images and convert it into suitable language which is grammatically true. In this paper, we reviewed the existing image captioning techniques in details, divided into different categories, datasets, and results. We are focusing on those methods, which are correct grammatically and syntactically and related to deep learning (Neural network). We have also compared and reviewed the different datasets like COCO, flicker8K, etc., and results of different image captioning techniques.
##plugins.themes.academic_pro.article.details##
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
- Bach, F. R., & Jordan, M. I. (07/02). Kernel Independent Component Analysis. Journal of Machine Learning, 48.
- Bai, S., & An, S. (2018). A survey on automatic image caption generation. Neurocomputing, 311, 291–304. https://doi.org/10.1016/j.neucom.2018.05.080 DOI: https://doi.org/10.1016/j.neucom.2018.05.080
- Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2422–2431. https://doi.org/10.1109/CVPR.2015.7298856 DOI: https://doi.org/10.1109/CVPR.2015.7298856
- Dekang Lin. (1998). An information-theoretic definition of similarity. Icml, 98.
- Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J. C., Zitnick, C. L., & Zweig, G. (2015). From captions to visual concepts and back. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1473–1482. https://doi.org/10.1109/CVPR.2015.7298754
- Han, Y., & Li, G. (2015). Describing Images with Hierarchical Concepts and Object Class Localization. Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, 251–258. https://doi.org/10.1145/2671188.2749290 DOI: https://doi.org/10.1145/2671188.2749290
- Hao Fang, Forrest Iandola, Saurabh Gupta, & Rupesh K. Srivastava. (2015). From Captions to Visual Concepts and Back. IEEE Conference on Computer Vision and Pattern Recognition, 1473–1482. DOI: https://doi.org/10.1109/CVPR.2015.7298754
- Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical Correlation Analysis: An Overview with Application to Learning Methods. Neural Computation, 16(12), 2639–2664. https://doi.org/10.1162/0899766042321814 DOI: https://doi.org/10.1162/0899766042321814
- Hodosh, M., Young, P., & Hockenmaier, J. (n.d.). Framing Image Description as a Ranking Task Data, Models and Evaluation Metrics Extended Abstract. 5.
- Hossain, MD. Z., Sohel, F., Shiratuddin, M. F., & Laga, H. (2019). A Comprehensive Survey of Deep Learning for Image Captioning. ACM Computing Surveys, 51(6), 1–36. https://doi.org/10.1145/3295748 DOI: https://doi.org/10.1145/3295748
- Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J. M., Mattern, F., Mitchell, J. C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., Sudan, M., Terzopoulos, D., Tygar, D., Vardi, M. Y., Weikum, G., Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., … Forsyth, D. (2010). Every Picture Tells a Story: Generating Sentences from Images. In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), Computer Vision – ECCV 2010 (Vol. 6314, pp. 15–29). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-15561-1_2 DOI: https://doi.org/10.1007/978-3-642-15561-1_2
- J. Curran, S. Clark, & J. Bos. (n.d.). Linguistically motivated large-scale NLP with CC and boxer. Forty Fifth Annual Meeting of the ACL on Inter- Active Poster and Demonstration Sessions, 33–36.
- Jia, X., Gavves, E., Fernando, B., & Tuytelaars, T. (2015). Guiding the Long-Short Term Memory Model for Image Caption Generation. 2015 IEEE International Conference on Computer Vision (ICCV), 2407–2415. https://doi.org/10.1109/ICCV.2015.277 DOI: https://doi.org/10.1109/ICCV.2015.277
- Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932 DOI: https://doi.org/10.1109/CVPR.2015.7298932
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
- Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2013). BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12), 2891–2903. https://doi.org/10.1109/TPAMI.2012.162 DOI: https://doi.org/10.1109/TPAMI.2012.162
- Li, X., Song, X., Herranz, L., Zhu, Y., & Jiang, S. (2016). Image Captioning with both Object and Scene Information. Proceedings of the 24th ACM International Conference on Multimedia, 1107–1110. https://doi.org/10.1145/2964284.2984069 DOI: https://doi.org/10.1145/2964284.2984069
- Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2014). Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). https://doi.org/10.48550/ARXIV.1412.6632
- Margaret Mitchell, Jesse Dodge, Amit Goyal, & Kota Yamaguchi. (2012). Midge: Generating Image Descriptions from Computer Vision Detections. 747–756.
- Minsi Wang, Li Song, Xiaokang Yang, & Chuanfei Luo. (2016). A parallel-fusion RNN-LSTM architecture for image caption generation. IEEE International Conference, 4448–4452.
- Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2Text: Describing images using 1 million captioned photographs. Proceedings of the 24th International Conference on Neural Information Processing Systems, 1143–1151.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. https://doi.org/10.3115/1073083.1073135 DOI: https://doi.org/10.3115/1073083.1073135
- Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models. 2015 IEEE International Conference on Computer Vision (ICCV), 2641–2649. https://doi.org/10.1109/ICCV.2015.303 DOI: https://doi.org/10.1109/ICCV.2015.303
- Ryan Kiros & Kelvin Xu. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. 32 Nd International Conference on Machine Learning, Lille, France, 37.
- Ryan Kiros & Richard SZemel. (2014). Unifying visual-semantic embeddings with multimodal neural language models. Neural Information Processing Systems (NIPS).
- Ryan Kiros, Ruslan Salakhutdinov, & Rich Zemel. (n.d.). Multimodal neural language models. 31st International Conference on Machine Learning (ICML-14), 595–603.
- Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093 DOI: https://doi.org/10.1109/78.650093
- Shubo Ma & Yahong Han. (2016). Describing Images By Feeding Lstm With Structuralwords. IEEE International Conference, Multimedia and Expo (ICME).
- Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition (arXiv:1409.1556). arXiv. http://arxiv.org/abs/1409.1556
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. https://doi.org/10.48550/ARXIV.1409.3215
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3156–3164. https://doi.org/10.1109/CVPR.2015.7298935 DOI: https://doi.org/10.1109/CVPR.2015.7298935
- Wang, M., Song, L., Yang, X., & Luo, C. (2016). A parallel-fusion RNN-LSTM architecture for image caption generation. 2016 IEEE International Conference on Image Processing (ICIP), 4448–4452. https://doi.org/10.1109/ICIP.2016.7533201 DOI: https://doi.org/10.1109/ICIP.2016.7533201
- Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016a). Review Networks for Caption Generation (arXiv:1605.07912). arXiv. http://arxiv.org/abs/1605.07912
- Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., & Cohen, W. W. (2016b). Review Networks for Caption Generation. In 30th Conference on Neural Image Processing System (NIPS). http://arxiv.org/abs/1605.07912
- You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016a). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651–4659. https://doi.org/10.1109/CVPR.2016.503
- You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016b). Image Captioning with Semantic Attention. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4651–4659. https://doi.org/10.1109/CVPR.2016.503 DOI: https://doi.org/10.1109/CVPR.2016.503