Author: Radwan,Ahmed Mostafa Soliman/ Title: Visual Question Answering Using Deep Learning Techniques/

Search In this Thesis

العنوان

Visual Question Answering Using Deep Learning Techniques/

المؤلف

Radwan,Ahmed Mostafa Soliman

هيئة الاعداد

باحث / أحمد مصطفى سليمان رضوان

مشرف / حازم محمود عباس

مناقش / عمر حسن كرم

مناقش / محمد واثق علي كامل الخراشي

تاريخ النشر

2021

عدد الصفحات

92p.:

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الهندسة الكهربائية والالكترونية

تاريخ الإجازة

1/1/2021

مكان الإجازة

جامعة عين شمس - كلية الهندسة - كهرباء حاسبات

الفهرس

Only 14 pages are availabe for public view

from

129

from

129

Abstract

Visual Question Answering (VQA) is a recent task that challenges algorithms to reason about the visual content of an image to be able to answer a natural language question. In this work, the performance of state of the art VQA algorithms on different VQA benchmarks is evaluated. Each benchmark is more effective at testing VQA algorithms on different levels. Some datasets challenge the algorithms to perform complex reasoning steps to arrive to an answer. Other datasets might challenge algorithms to retrieve external world knowledge to answer the posed questions. The algorithms reviewed and used in our experiments are categorized by their main contributions into 4 categories. Firstly, the joint embedding approach which focuses on how to map the visual and textual data into a common embedding space. Secondly, attention based methods which focuses on relevant parts of the image or the question. Thirdly, compositional models which deal with composing a model from smaller modules. Finally, we introduce externalknowledge based algorithms which need external sources to be able to retrieve facts necessary to answer a question. Other algorithms that don’t specifically belong to the aforementioned categories, but offer state of the art performance, are also included.
Our work also introduces the first Arabic dataset in VQA that testes algorithms abilities to do complex visual reasoning, AR-CLEVR. The Arabic questions are generated for synthetic scenes using algorithms that auto-generates questions based on ground truth information from the scene’s graph. Results from the experiments conducted on the state of the art algorithms helps us conclude the best algorithm we should choose for our newly introduced dataset. The new dataset is integrated within the openvqa framework, to enable future researchers interested in the VQA problem to easily reproduce our results and use new algorithms on our new benchmark.