Image Captioning Generator Text-to-Speech
##plugins.themes.academic_pro.article.main##
Abstract
A model is created for blind people that can guide and support them while traveling on the highways just with the help of a smartphone application. This can be accomplished by first converting the scene in front of the user into text and then converting text into voice output. Then a method for the generation of image legends based on deep neural networks. With an image as an entry, the method can display an English sentence describing the contents of the image. The user first provides a voice command, then a quick snapshot is captured by the camera or webcam. This image is then fed as input to the image caption generator template that generates a caption for the image. Next, this caption text is converted to speech, which gives rise to a voice message on the description of the image.
##plugins.themes.academic_pro.article.details##
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
- Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). Ieee. DOI: https://doi.org/10.1109/CVPR.2009.5206848
- Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3128-3137). DOI: https://doi.org/10.1109/CVPR.2015.7298932
- Yang, Z., Zhang, Y. J., & Huang, Y. (2017, September). Image captioning with object detection and localization. In International Conference on Image and Graphics (pp. 109-118). Springer, DOI: https://doi.org/10.1007/978-3-319-71589-6_10
- Aneja, J., Deshpande, A., & Schwing, A. G. (2018). Convolutional image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5561-5570). DOI: https://doi.org/10.1109/CVPR.2018.00583
- Pan, J. Y., Yang, H. J., Duygulu, P., & Faloutsos, C. (2004, June). Automatic image captioning. In 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763) [](Vol. 3, pp. 1987-1990). IEEE.
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164). DOI: https://doi.org/10.1109/CVPR.2015.7298935
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... & Bengio, Y. (2015, June). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (pp. 2048-2057). PMLR.
- Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853-899. DOI: https://doi.org/10.1613/jair.3994
- Cieri, C., Miller, D., & Walker, K. (2004, May). The Fisher corpus: A resource for the next generations of speech-to-text. In LREC (Vol. 4, pp. 69-71).