A Review Of Video Captioning Methods

Dewarthi Mahajan; Sakshi Bhosale; Yash Nighot; Madhuri Tayal

doi:10.47164/ijngc.v12i5.458

Published Nov 26, 2021

https://doi.org/10.47164/ijngc.v12i5.458

Download

PDF

Statistic

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Volume 12, Special Issue 5, November 2021

Dewarthi Mahajan

a:1:{s:5:"en_US";s:2:"Ms";}

Sakshi Bhosale

Yash Nighot

Madhuri Tayal

Abstract

Video captioning is the process of creating a natural language sentence that summarises the video's contents automatically. Modeling the video's effective temporal composition and effectively integrating that information into a plain language description are both required. It has a variety of applications, including assisting the visually impaired, video subtitling, and video surveillance, among others. Due to the advancement of deep learning in computer vision and natural language processing, there has been a surge in study in this area in recent years. Video captioning is the result of combining these two worlds of computer vision and natural language processing. In this study, we examine and analyse various strategies for addressing this issue, as well as benchmark datasets in terms of domains, repository size, and number of classes; and identify the benefits and drawbacks of various evaluation metrics such as BLEU, METEOR, CIDEr, SPICE, and ROUGE.

This work is licensed under a Creative Commons Attribution 4.0 International License.

How to Cite

Mahajan, D., Bhosale, S., Nighot, Y., & Tayal, M. (2021). A Review Of Video Captioning Methods. International Journal of Next-Generation Computing, 12(5). https://doi.org/10.47164/ijngc.v12i5.458

References

Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. Spice: Semantic propo- sitional image caption evaluation. DOI: https://doi.org/10.1007/978-3-319-46454-1_24
Banerjee, S. and Lavie, A. 2005. An automatic metric for mt evaluation with improved correlation with human judgments. ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization.
Chen, Shaoxiang, Yao, T., and Jiang, Y.-G. 2019. Deep learning for video captioning: A review. IJCAI. DOI: https://doi.org/10.24963/ijcai.2019/877
Chen, D. and Dolan, W. 2011. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Association for Computational Linguistics.
Iashin, Vladimir, and Rahtu, E. 2020. Multi-modal dense video captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. DOI: https://doi.org/10.1109/CVPRW50498.2020.00487
Kojima, A., Tamura, T., and Fukunaga, K. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision.
Lin, C. Rouge: A package for automatic evaluation of summaries.
Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. A method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on ACL.
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., and Pinkal, M. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics. DOI: https://doi.org/10.1162/tacl_a_00207
Rohrbach, M., N., R., Tandon, and Schiele, B. 2015. A dataset for movie description. IEEE conference on computer vision and pattern recognition. pages 3202–3212 International Journal of Next-Generation Computing, Vol. 6, No. 3, November 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298940
Sigurdsson, G., Varol, G., Wang, X., Farhadi, A., Laptev, I., and Gupta, A. 2016. Hollywood in homes: crowdsourcing data collection for activity understanding. Proceedings of the European Conference on Computer Vision. DOI: https://doi.org/10.1007/978-3-319-46448-0_31
Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. Cider: Consensus-based image descrip- tion evaluation. IEEE CVPR. DOI: https://doi.org/10.1109/CVPR.2015.7299087
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., and Saenko, K. 2014. Translating videos to natural language using deep recurrent neural networks. DOI: https://doi.org/10.3115/v1/N15-1173
Xu, J., Mei, T., Yao, T., and Rui, Y. 2016. A large video description dataset for bridging video and language. . In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 5288–5296. IEEE. DOI: https://doi.org/10.1109/CVPR.2016.571
Xu, R., Xiong, C., Chen, W., and Corso, J. J. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. DOI: https://doi.org/10.1609/aaai.v29i1.9512
Yu, H., Wang, J., Huang, Z., Yang, Y., and Xu, W. 2016. Video paragraph captioning using hierarchical recurrent neural networks. IEEE conference on computer vision and pattern recognition. DOI: https://doi.org/10.1109/CVPR.2016.496

About Journal

A Review Of Video Captioning Methods

Downloads

Metrics

Abstract

References

Most read articles by the same author(s)

About Journal

##plugins.themes.academic_pro.article.sidebar##

Downloads

Metrics

##plugins.themes.academic_pro.article.main##

Abstract

##plugins.themes.academic_pro.article.details##

References

Most read articles by the same author(s)