Investigation of Available Datasets and Techniques for Visual Question Answering

##plugins.themes.academic_pro.article.main##

Lata Bhavnani
Dr. Narendra Patel

Abstract

Visual Question Answering (VQA) is an emerging AI research problem that combines computer vision, natural language processing, knowledge representation & reasoning (KR). Given image and question related to the image as input, it requires analysis of visual components of the image, type of question, and common sense or general knowledge to predict the right answer. VQA is useful in different real-time applications like blind person assistance, autonomous driving, solving trivial tasks like spotting empty tables in hotels, parks, or picnic places, etc. Since its introduction in 2014, many researchers have worked and applied different techniques for Visual question answering. Also, different datasets have been introduced. This paper presents an overview of available datasets and evaluation metrices used in the VQA area. Further paper presents different techniques used in the VQA domain. Techniques are categorized based on the mechanism used. Based on the detailed discussion and performance comparison we discuss various challenges in the VQA domain and provide directions for future work.

##plugins.themes.academic_pro.article.details##

How to Cite
Bhavnani, L., & Patel, D. N. . (2023). Investigation of Available Datasets and Techniques for Visual Question Answering . International Journal of Next-Generation Computing, 14(3). https://doi.org/10.47164/ijngc.v14i3.767

References

  1. Acharya, M., Kafle, K., & Kanan, C. (2018). TallyQA: Answering Complex Counting Questions. ArXiv:1810.12440 [Cs]. http://arxiv.org/abs/1810.12440
  2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), 2425–2433. https://doi.org/10.1109/ICCV.2015.279 DOI: https://doi.org/10.1109/ICCV.2015.279
  3. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. 2901–2910. https://openaccess.thecvf.com/content_cvpr_2017/html/Johnson_CLEVR_A_Diagnostic_CVPR_2017_paper.html DOI: https://doi.org/10.1109/CVPR.2017.215
  4. Kafle, K., & Kanan, C. (2016). Answer-Type Prediction for Visual Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4976–4984. https://doi.org/10.1109/CVPR.2016.538 DOI: https://doi.org/10.1109/CVPR.2016.538
  5. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, II-1188-II–1196.
  6. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common Objects in Context. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 740–755). Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1_48 DOI: https://doi.org/10.1007/978-3-319-10602-1_48
  7. Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. Proceedings of the 30th International Conference on Neural Information Processing Systems, 289–297.
  8. Ma, L., Lu, Z., & Li, H. (2016). Learning to answer questions from image using convolutional neural network. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 3567–3573. DOI: https://doi.org/10.1609/aaai.v30i1.10442
  9. Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1, 1682–1690.
  10. Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images. 2015 IEEE International Conference on Computer Vision (ICCV), 1–9. https://doi.org/10.1109/ICCV.2015.9 DOI: https://doi.org/10.1109/ICCV.2015.9
  11. Marino, K., Rastegari, M., Farhadi, A., & Mottaghi, R. (2019). OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3190–3199. https://doi.org/10.1109/CVPR.2019.00331 DOI: https://doi.org/10.1109/CVPR.2019.00331
  12. Ren, M. (2015). Image Question Answering: A Visual Semantic Embedding Model and a New Dataset. CoRR, abs/1505.02074. http://arxiv.org/abs/1505.02074
  13. Ren, M., Kiros, R., & Zemel, R. S. (2015). Exploring models and data for image question answering. Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, 2953–2961.
  14. Shah, S., Mishra, A., Yadati, N., & Talukdar, P. P. (2019). KVQA: Knowledge-Aware Visual Question Answering. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 8876–8884. https://doi.org/10.1609/aaai.v33i01.33018876 DOI: https://doi.org/10.1609/aaai.v33i01.33018876
  15. Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to Look: Focus Regions for Visual Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4613–4621. https://doi.org/10.1109/CVPR.2016.499 DOI: https://doi.org/10.1109/CVPR.2016.499
  16. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor Segmentation and Support Inference from RGBD Images. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, & C. Schmid (Eds.), Computer Vision – ECCV 2012 (pp. 746–760). Springer. https://doi.org/10.1007/978-3-642-33715-4_54 DOI: https://doi.org/10.1007/978-3-642-33715-4_54
  17. Srivastava, Y., Murali, V., Dubey, S. R., & Mukherjee, S. (2019). Visual Question Answering using Deep Learning: A Survey and Performance Analysis. ArXiv:1909.01860 [Cs]. http://arxiv.org/abs/1909.01860
  18. Wang, P., Wu, Q., Shen, C., Dick, A., & Hengel, A. van den. (2018). FVQA: Fact-Based Visual Question Answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(10), 2413–2427. https://doi.org/10.1109/TPAMI.2017.2754246 DOI: https://doi.org/10.1109/TPAMI.2017.2754246
  19. Wang, P., Wu, Q., Shen, C., Dick, A., & Van Den Henge, A. (2017). Explicit knowledge-based reasoning for visual question answering. Proceedings of the 26th International Joint Conference on Artificial Intelligence, 1290–1296. DOI: https://doi.org/10.24963/ijcai.2017/179
  20. Wu, Q., Wang, P., Shen, C., Dick, A., & Hengel, A. V. D. (2016). Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4622–4630. https://doi.org/10.1109/CVPR.2016.500 DOI: https://doi.org/10.1109/CVPR.2016.500
  21. Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 21–29. https://doi.org/10.1109/CVPR.2016.10 DOI: https://doi.org/10.1109/CVPR.2016.10
  22. Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the Blank Description Generation and Question Answering. 2015 IEEE International Conference on Computer Vision (ICCV), 2461–2469. https://doi.org/10.1109/ICCV.2015.283 DOI: https://doi.org/10.1109/ICCV.2015.283
  23. Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded Question Answering in Images. 4995–5004. https://openaccess.thecvf.com/content_cvpr_2016/html/Zhu_Visual7W_Grounded_Question_CVPR_2016_paper.html DOI: https://doi.org/10.1109/CVPR.2016.540