Image Captioning Generator Text-to-Speech

##plugins.themes.academic_pro.article.main##

Sharma Tripti
Neetu Anand
Kumar Gaurav
Rohit Kapoor

Abstract

A model is created for  blind people that can guide and support them while traveling on the highways just with the help of a smartphone application. This can be accomplished by first converting the scene in front of the user into text and  then converting text into voice output. Then a method for the generation of image legends based on deep neural networks. With an image as an entry, the method can display an English sentence describing the contents of the image. The user first provides a voice command, then a quick snapshot is captured by the camera or webcam. This image is then fed as input to the image caption generator template that generates a caption for the image. Next, this caption text is converted to speech, which gives rise to a voice message on the description of the image.

##plugins.themes.academic_pro.article.details##

How to Cite
Tripti, S., Anand, N., Gaurav, K. ., & Kapoor, R. (2022). Image Captioning Generator Text-to-Speech. International Journal of Next-Generation Computing, 13(3). https://doi.org/10.47164/ijngc.v13i3.669

References

  1. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
  2. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee. DOI: https://doi.org/10.1109/CVPR.2009.5206848
  3. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137). DOI: https://doi.org/10.1109/CVPR.2015.7298932
  4. Yang, Z., Zhang, Y. J., & Huang, Y. (2017, September). Image captioning with object detection and localization. In International Conference on Image and Graphics (pp. 109-118). Springer, DOI: https://doi.org/10.1007/978-3-319-71589-6_10
  5. Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5561-5570). DOI: https://doi.org/10.1109/CVPR.2018.00583
  6. Pan, J. Y., Yang, H. J., Duygulu, P., & Faloutsos, C. (2004, June). Automatic image captioning. In 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763) [](Vol. 3, pp. 1987-1990). IEEE.
  7. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164). DOI: https://doi.org/10.1109/CVPR.2015.7298935
  8. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.
  9. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853-899. DOI: https://doi.org/10.1613/jair.3994
  10. Cieri, C., Miller, D., & Walker, K. (2004, May). The Fisher corpus: A resource for the next generations of speech-to-text. In LREC (Vol. 4, pp. 69-71).