Transformer based image caption generation for news articles ·
We address the task of news-image captioning, which generates a description of an image given the image and its article body as input. The motive is to automatically generate captions for news images which if needed can then be used as reference captions for manually creating news image captions This task is more challenging than conventional image captioning because it requires a joint understanding of image and text. We present an N-Gram model that integrates text and image modalities and attends to textual features from visual features in generating a caption. Experiments based on automatic evaluation metrics and human evaluation show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-of-the-art model. In addition, we also confirm that visual features contribute to improving the quality of news-image captions. Also, we present a website that takes an image and its associated article as input and generates a one-liner caption for the same.
This work is licensed under a Creative Commons Attribution 4.0 International License.
- A. Chang, M. S. and Manning, C. 2014. Interactive learning of spatial knowledge for text to 3d scene generation. DOI: https://doi.org/10.3115/v1/W14-3102
- A. Farhadi, M. Hejrati, M. S. P. Y. C. R. J. H. and Forsyth, D. 2010. . every picture tells a story: Generating sentences from images. DOI: https://doi.org/10.1007/978-3-642-15561-1_2
- A. Quattoni, A. Ramisa, P. S. E. S.-S. and MorenoNoguer, F. 2016. . structured prediction with output embeddings for semantic image annotation. DOI: https://doi.org/10.18653/v1/N16-1068
- Arnau Ramisa*, Fei Yan*, F. M.-N. and Mikolajczyk, K. . breakingnews: Article annotation by image and text processing.
- C. Rashtchian, P. Young, M. H. and Hockenmaier, J. 2010. Collecting image annotations using amazon’s mechanical turk.
- C. Zitnick, D. P. and Vanderwende, L. 2013. Learning the visual interpretation of sentences. DOI: https://doi.org/10.1109/ICCV.2013.211
- Chen, X. and Zitnick., C. 2015. Mind’s eye: A recurrent visual representation for image caption generation. CVPR. DOI: https://doi.org/10.1109/CVPR.2015.7298856
- G. Kulkarni, V. Premraj, S. D.-S. L.-Y. C. A. B. and Berg, T. 2011. Baby talk: Understanding and generating image descriptions. DOI: https://doi.org/10.1109/CVPR.2011.5995466
- Han, M., C. W. . M.-A. D. 2019. Fast image captioning using lstm. cluster computing. DOI: https://doi.org/10.1007/s10586-018-1885-9
- J. Deng, W. Dong, R. S.-K. L.-K. L. and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. DOI: https://doi.org/10.1109/CVPR.2009.5206848
- J. Donahue, L. Hendricks, S. G.-M. R.-S. V. . S. and T.Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. DOI: https://doi.org/10.21236/ADA623249
- J. Johnson, A. K. and Fei-Fei., L. 2015. Densecap: Fully convolutional localization networks for dense captioning. DOI: https://doi.org/10.1109/CVPR.2016.494
- J. Mao, W. Xu, Y. Y.-J. W.-Z. H. and Yuille., A. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn).
- J. Xiao, J. Hays, K. E.-A. O. and Torralba., A. 2010. Sun database: Large-scale scene recognition from abbey to zoo. DOI: https://doi.org/10.1109/CVPR.2010.5539970
- K. Barnard, P. Duygulu, D. F.-N. D. F. D. B. and Jordan., M. 2003. . matching words and pictures. the journal of machine learning research.
- K. Xu, J. Ba, R. K.-K. C. A. C. R. S. R. Z. and Bengio, Y. 2015. Show, attend and tell:
- Neural image caption generation with visual attention.
- Karpathy, A. and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. CVPR. DOI: https://doi.org/10.1109/CVPR.2015.7298932
- Laura Hollink, Adriatik Bedjeti, M. v. H. and Elliott, D. 2016. A corpus of images and text in online news.
- Li, H. Z. D. Q. R. W. D. J. G. L. Z. N. T. 2019a. Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images.
- Li, J., Y. P. G. L. . Z. W. 2019b. Boosted transformer for image captioning. DOI: https://doi.org/10.1109/ICCV.2019.00902
- M. Everingham, L. Van Gool, C. W. J. W. and Zisserman, A. 2010. The pascal visual object classes (voc) challenge. DOI: https://doi.org/10.1007/s11263-009-0275-4
- M. Hodosh, P. Y. and Hockenmaier, J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics.
- M. Hodosh, P. Y. and Hockenmaier., J. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. journal of artificial intelligence research. DOI: https://doi.org/10.1613/jair.3994
- Malinowski, M. and Fritz., M. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input.
- O. Vinyals, A. Toshev, S. B. and Erhan, D. 2015. Show and tell: A neural image caption generator. DOI: https://doi.org/10.1109/CVPR.2015.7298935
- Oliva, A. and Torralba, A. 2006. Building the gist of a scene: The role of global image features in recognition. DOI: https://doi.org/10.1016/S0079-6123(06)55002-2
- P. Young, A. Lai, M. H. and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. DOI: https://doi.org/10.1162/tacl_a_00166
- Patel, H. K. and Rathod, J. M. 2015. Comparative study on image captioning. ijngc Vol.13 No. 4.
- R. Kiros, R. S. and Zemel., R. 2015. Unifying visualsemantic embeddings with multimodal neural language models.
- Socher, R. and Fei-Fei., L. 2010. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. DOI: https://doi.org/10.1109/CVPR.2010.5540112
- Stanislaw Antol, Aishwarya Agrawal, J. L. M. M. D. B. C. L. Z. and Parikh, D. 2015. Vqa: Visual question answering. DOI: https://doi.org/10.1109/ICCV.2015.279
- T. Lin, M. Maire, S. B. J. H. P. P. D. R. P. D. and Zitnick, C. 2014. Microsoft coco: Common objects in context. DOI: https://doi.org/10.1007/978-3-319-10602-1_48
- Tan, J. H., C. C. S. . C. J. H. 2019. Comic: Toward a compact image captioning model with attention. DOI: https://doi.org/10.1109/TMM.2019.2904878
- Tank, D. and Chourasia, D. 2021. Image to text matching captioning for news images.
- V. Ordonez, G. K. and Berg, T. 2011. Im2text: Describing images using 1 million captioned photographs.
- V. Ordonez, X. Han, P. K. G. K. M. M. K. Y. K. S. e. a. 2015. Large scale retrieval and generation of image descriptions. DOI: https://doi.org/10.1007/s11263-015-0840-y
- Wang, C., Y. H. . M. C. 2018. Image captioning with deep bidirectional lstms and multi-task learning. DOI: https://doi.org/10.1145/3115432
- Y. Gong, L. Wang, M. H. J. H. and Lazebnik, S. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. DOI: https://doi.org/10.1007/978-3-319-10593-2_35
- Y. Jia, E. Shelhamer, J. D. S. K. J. L. R. G. S. G. and Darrell, T. 2014. Caffe: Convolutional architecture for fast feature embedding. DOI: https://doi.org/10.1145/2647868.2654889
- Yang, L., . H. H. 2019. Adaptive syncretic attention for constrained image captioning. neural processing letters. DOI: https://doi.org/10.1007/s11063-019-10045-5