OCR-VQA: Visual Question Answering by Reading Text in Images (Research Paper Summary)

「ツール」は右上に移動しました。

32いいね 1482回再生

OCR-VQA: Visual Question Answering by Reading Text in Images (Research Paper Summary)

#ai #vqa #nlp
The problem of answering questions about an image is popularly known as visual question answering (or VQA in short). It is a well-established problem in computer vision. However, none of the VQA methods currently utilize the text often present in the image. These "texts in images" provide additional useful cues and facilitate better understanding of the visual content. In this paper, we introduce a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR. We refer to this problem as OCR-VQA. To facilitate a systematic way of studying this new problem, we introduce a large-scale dataset, namely OCRVQA-200K. This dataset comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images. We judiciously combine well-established techniques from OCR and VQA domains to present a novel baseline for OCR-VQA-200K. The experimental results and rigorous analysis demonstrate various challenges present in this dataset leaving ample scope for the future research. We are optimistic that this new task along with compiled dataset will open-up many exciting research avenues both for the document image analysis and the VQA communities.

⏩ Paper Title: OCR-VQA: Visual Question Answering by Reading Text in Images
⏩ Paper: anandmishra22.github.io/files/mishra-OCR-VQA.pdf
⏩ Author: Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, Anirban Chakraborty
⏩ Organisation: IISC Bangalore, TCS Research

⏩ OUTLINE:
00:00-01:30 : Abstract
01:30-2:25 : OCR-VQA example
02:25-04:16 : Data collection and annotation
04:16-05:08 : Challenges in OCR-VQA 200K
05:18-07:15 : Examples of OCR-VQA
07:15-10:46 : OCR-VQA Model architecture

Enjoy reading articles? then consider subscribing to Medium membership, it just 5$ a month for unlimited access to all free/paid content.
Subscribe now - prakhar-mishra.medium.com/membership

*********************************************
If you want to support me financially which totally optional and voluntary :) ❤️
You can consider buying me chai ( because i don't drink coffee :) ) at www.buymeacoffee.com/TechvizCoffee
*********************************************

⏩ IMPORTANT LINKS
Research Paper Summaries:    • Simple Unsupervised Keyphrase Extract...

*********************************************
⏩ Youtube -    / @techvizthedatascienceguy
⏩ LinkedIn - linkedin.com/in/prakhar21
⏩ Medium - medium.com/@prakhar.mishra
⏩ GitHub - github.com/prakhar21
*********************************************

⏩ Please feel free to share out the content and subscribe to my channel -    / @techvizthedatascienceguy

Tools I use for making videos :)
⏩ iPad - tinyurl.com/y39p6pwc
⏩ Apple Pencil - tinyurl.com/y5rk8txn
⏩ GoodNotes - tinyurl.com/y627cfsa

#techviz #datascienceguy #deeplearning #ai #transformers #documentparsing

About Me:
I am Prakhar Mishra and this channel is my passion project. I am currently pursuing my MS (by research) in Data Science. I have an industry work-ex of 3+ years in the field of Data Science and Machine Learning with a particular focus on Natural Language Processing (NLP).