A Review Of Video Captioning Methods


Dewarthi Mahajan
Sakshi Bhosale
Yash Nighot
Madhuri Tayal


Video captioning is the process of creating a natural language sentence that summarises the video's contents automatically. Modeling the video's effective temporal composition and effectively integrating that information into a plain language description are both required. It has a variety of applications, including assisting the visually impaired, video subtitling, and video surveillance, among others. Due to the advancement of deep learning in computer vision and natural language processing, there has been a surge in study in this area in recent years. Video captioning is the result of combining these two worlds of computer vision and natural language processing. In this study, we examine and analyse various strategies for addressing this issue, as well as benchmark datasets in terms of domains, repository size, and number of classes; and identify the benefits and drawbacks of various evaluation metrics such as BLEU, METEOR, CIDEr, SPICE, and ROUGE.


How to Cite
Mahajan, D., Bhosale, S., Nighot, Y., & Tayal, M. (2021). A Review Of Video Captioning Methods. International Journal of Next-Generation Computing, 12(5). https://doi.org/10.47164/ijngc.v12i5.458


  1. Anderson, P., Fernando, B., Johnson, M., and Gould, S. 2016. Spice: Semantic propo- sitional image caption evaluation.
  2. Banerjee, S. and Lavie, A. 2005. An automatic metric for mt evaluation with improved correlation with human judgments. ACL workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization.
  3. Chen, Shaoxiang, Yao, T., and Jiang, Y.-G. 2019. Deep learning for video captioning: A review. IJCAI.
  4. Chen, D. and Dolan, W. 2011. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Association for Computational Linguistics.
  5. Iashin, Vladimir, and Rahtu, E. 2020. Multi-modal dense video captioning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
  6. Kojima, A., Tamura, T., and Fukunaga, K. 2002. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision.
  7. Lin, C. Rouge: A package for automatic evaluation of summaries.
  8. Papineni, K., Roukos, S., Ward, T., and Zhu, W. 2002. A method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting on ACL.
  9. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., and Pinkal, M. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics.
  10. Rohrbach, M., N., R., Tandon, and Schiele, B. 2015. A dataset for movie description. IEEE conference on computer vision and pattern recognition. pages 3202–3212 International Journal of Next-Generation Computing, Vol. 6, No. 3, November 2015.
  11. Sigurdsson, G., Varol, G., Wang, X., Farhadi, A., Laptev, I., and Gupta, A. 2016. Hollywood in homes: crowdsourcing data collection for activity understanding. Proceedings of the European Conference on Computer Vision.
  12. Vedantam, R., Zitnick, C. L., and Parikh, D. 2015. Cider: Consensus-based image descrip- tion evaluation. IEEE CVPR.
  13. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., and Saenko, K. 2014. Translating videos to natural language using deep recurrent neural networks.
  14. Xu, J., Mei, T., Yao, T., and Rui, Y. 2016. A large video description dataset for bridging video and language. . In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 5288–5296. IEEE.
  15. Xu, R., Xiong, C., Chen, W., and Corso, J. J. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework.
  16. Yu, H., Wang, J., Huang, Z., Yang, Y., and Xu, W. 2016. Video paragraph captioning using hierarchical recurrent neural networks. IEEE conference on computer vision and pattern recognition.