Transformer based image caption generation for news articles	·

Ashtavinayak Pande; Atul  Pandey; Ayush Solanki; Chinmay Shanbhag; Manish Motghare

doi:10.47164/ijngc.v14i1.1033

Published Feb 15, 2023

https://doi.org/10.47164/ijngc.v14i1.1033

Download

pdf

Statistic

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Volume 14, Special Issue 1, February 2023

Ashtavinayak Pande

Shri Ramdeobaba College Of Engineering and Management, Nagpur

Atul Pandey

Shri Ramdeobaba College Of Engineering and Management, Nagpur

Ayush Solanki

Shri Ramdeobaba College Of Engineering and Management, Nagpur

Chinmay Shanbhag

Shri Ramdeobaba College Of Engineering and Management, Nagpur

Manish Motghare

Shri Ramdeobaba College Of Engineering and Management, Nagpur

Abstract

We address the task of news-image captioning, which generates a description of an image given the image and its article body as input. The motive is to automatically generate captions for news images which if needed can then be used as reference captions for manually creating news image captions This task is more challenging than conventional image captioning because it requires a joint understanding of image and text. We present an N-Gram model that integrates text and image modalities and attends to textual features from visual features in generating a caption. Experiments based on automatic evaluation metrics and human evaluation show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-of-the-art model. In addition, we also confirm that visual features contribute to improving the quality of news-image captions. Also, we present a website that takes an image and its associated article as input and generates a one-liner caption for the same.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Author Biographies

Atul Pandey , Shri Ramdeobaba College Of Engineering and Management, Nagpur

Atul Pandey is student at Shri Ramdeobaba College Of Engineering and Management, Nagpur and pursuing his B.E. Degree in stream of Computer Science and Engineering.

Ayush Solanki, Shri Ramdeobaba College Of Engineering and Management, Nagpur

Ayush Solanki is a student at Shri Ramdeobaba College Of Engineering and Management, Nagpur, and pursuing his B.E. Degree in the stream of Computer Science and Engineering.

Chinmay Shanbhag, Shri Ramdeobaba College Of Engineering and Management, Nagpur

Chinmay Shanbhag is a student at Shri Ramdeobaba College Of Engineering and Management, Nagpur, and pursuing his B.E. Degree in the stream of Computer Science and Engineering. E-mail: [email protected]

Manish Motghare, Shri Ramdeobaba College Of Engineering and Management, Nagpur

Prof. Manish Motghare is working as an Assistant Professor in the Computer Science and Engineering Department at Shri Ramdeobaba College of Engineering and Management.

E-mail: [email protected]

How to Cite

Pande, A., Pandey , A. ., Solanki, A., Shanbhag, C., & Motghare, M. (2023). Transformer based image caption generation for news articles ·. International Journal of Next-Generation Computing, 14(1). https://doi.org/10.47164/ijngc.v14i1.1033

References

A. Chang, M. S. and Manning, C. 2014. Interactive learning of spatial knowledge for text to 3d scene generation. DOI: https://doi.org/10.3115/v1/W14-3102
A. Farhadi, M. Hejrati, M. S. P. Y. C. R. J. H. and Forsyth, D. 2010. . every picture tells a story: Generating sentences from images. DOI: https://doi.org/10.1007/978-3-642-15561-1_2
A. Quattoni, A. Ramisa, P. S. E. S.-S. and MorenoNoguer, F. 2016. . structured prediction with output embeddings for semantic image annotation. DOI: https://doi.org/10.18653/v1/N16-1068
Arnau Ramisa*, Fei Yan*, F. M.-N. and Mikolajczyk, K. . breakingnews: Article annotation by image and text processing.
C. Rashtchian, P. Young, M. H. and Hockenmaier, J. 2010. Collecting image annotations using amazon’s mechanical turk.
C. Zitnick, D. P. and Vanderwende, L. 2013. Learning the visual interpretation of sentences. DOI: https://doi.org/10.1109/ICCV.2013.211
Chen, X. and Zitnick., C. 2015. Mind’s eye: A recurrent visual representation for image caption generation. CVPR. DOI: https://doi.org/10.1109/CVPR.2015.7298856
G. Kulkarni, V. Premraj, S. D.-S. L.-Y. C. A. B. and Berg, T. 2011. Baby talk: Understanding and generating image descriptions. DOI: https://doi.org/10.1109/CVPR.2011.5995466
Han, M., C. W. . M.-A. D. 2019. Fast image captioning using lstm. cluster computing. DOI: https://doi.org/10.1007/s10586-018-1885-9
J. Deng, W. Dong, R. S.-K. L.-K. L. and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. DOI: https://doi.org/10.1109/CVPR.2009.5206848
J. Donahue, L. Hendricks, S. G.-M. R.-S. V. . S. and T.Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. DOI: https://doi.org/10.21236/ADA623249
J. Johnson, A. K. and Fei-Fei., L. 2015. Densecap: Fully convolutional localization networks for dense captioning. DOI: https://doi.org/10.1109/CVPR.2016.494
J. Mao, W. Xu, Y. Y.-J. W.-Z. H. and Yuille., A. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn).
J. Xiao, J. Hays, K. E.-A. O. and Torralba., A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. DOI: https://doi.org/10.1109/CVPR.2010.5539970
K. Barnard, P. Duygulu, D. F.-N. D. F. D. B. and Jordan., M. 2003. . matching words and pictures. the journal of machine learning research.
K. Xu, J. Ba, R. K.-K. C. A. C. R. S. R. Z. and Bengio, Y. 2015. Show, attend and tell:
Neural image caption generation with visual attention.
Karpathy, A. and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. CVPR. DOI: https://doi.org/10.1109/CVPR.2015.7298932
Laura Hollink, Adriatik Bedjeti, M. v. H. and Elliott, D. 2016. A corpus of images and text in online news.
Li, H. Z. D. Q. R. W. D. J. G. L. Z. N. T. 2019a. Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images.
Li, J., Y. P. G. L. . Z. W. 2019b. Boosted transformer for image captioning. DOI: https://doi.org/10.1109/ICCV.2019.00902
M. Everingham, L. Van Gool, C. W. J. W. and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. DOI: https://doi.org/10.1007/s11263-009-0275-4
M. Hodosh, P. Y. and Hockenmaier, J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics.
M. Hodosh, P. Y. and Hockenmaier., J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. journal of artificial intelligence research. DOI: https://doi.org/10.1613/jair.3994
Malinowski, M. and Fritz., M. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input.
O. Vinyals, A. Toshev, S. B. and Erhan, D. 2015. Show and tell: A neural image caption generator. DOI: https://doi.org/10.1109/CVPR.2015.7298935
Oliva, A. and Torralba, A. 2006. Building the gist of a scene: The role of global image features in recognition. DOI: https://doi.org/10.1016/S0079-6123(06)55002-2
P. Young, A. Lai, M. H. and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. DOI: https://doi.org/10.1162/tacl_a_00166
Patel, H. K. and Rathod, J. M. 2015. Comparative study on image captioning. ijngc Vol.13 No. 4.
R. Kiros, R. S. and Zemel., R. 2015. Unifying visualsemantic embeddings with multimodal neural language models.
Socher, R. and Fei-Fei., L. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. DOI: https://doi.org/10.1109/CVPR.2010.5540112
Stanislaw Antol, Aishwarya Agrawal, J. L. M. M. D. B. C. L. Z. and Parikh, D. 2015. Vqa: Visual question answering. DOI: https://doi.org/10.1109/ICCV.2015.279
T. Lin, M. Maire, S. B. J. H. P. P. D. R. P. D. and Zitnick, C. 2014. Microsoft coco: Common objects in context. DOI: https://doi.org/10.1007/978-3-319-10602-1_48
Tan, J. H., C. C. S. . C. J. H. 2019. Comic: Toward a compact image captioning model with attention. DOI: https://doi.org/10.1109/TMM.2019.2904878
Tank, D. and Chourasia, D. 2021. Image to text matching captioning for news images.
V. Ordonez, G. K. and Berg, T. 2011. Im2text: Describing images using 1 million captioned photographs.
V. Ordonez, X. Han, P. K. G. K. M. M. K. Y. K. S. e. a. 2015. Large scale retrieval and generation of image descriptions. DOI: https://doi.org/10.1007/s11263-015-0840-y
Wang, C., Y. H. . M. C. 2018. Image captioning with deep bidirectional lstms and multi-task learning. DOI: https://doi.org/10.1145/3115432
Y. Gong, L. Wang, M. H. J. H. and Lazebnik, S. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. DOI: https://doi.org/10.1007/978-3-319-10593-2_35
Y. Jia, E. Shelhamer, J. D. S. K. J. L. R. G. S. G. and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. DOI: https://doi.org/10.1145/2647868.2654889
Yang, L., . H. H. 2019. Adaptive syncretic attention for constrained image captioning. neural processing letters. DOI: https://doi.org/10.1007/s11063-019-10045-5

About Journal

##plugins.themes.academic_pro.article.sidebar##

Downloads

Metrics

##plugins.themes.academic_pro.article.main##

Abstract

##plugins.themes.academic_pro.article.details##

Atul Pandey , Shri Ramdeobaba College Of Engineering and Management, Nagpur

Ayush Solanki, Shri Ramdeobaba College Of Engineering and Management, Nagpur

Chinmay Shanbhag, Shri Ramdeobaba College Of Engineering and Management, Nagpur

Manish Motghare, Shri Ramdeobaba College Of Engineering and Management, Nagpur

References