Author: Elnahas, Ayat Elnabawey Abdel Hay./ Title: Evaluation and Enhancement of Arabic Text<br>Retrieval Models /

Search In this Thesis

العنوان

Evaluation and Enhancement of Arabic Text
Retrieval Models /

المؤلف

Elnahas, Ayat Elnabawey Abdel Hay.

هيئة الاعداد

باحث / آيات النبوي عبد النبي النحاس

مشرف / نوال أحمد راغب الفيشاوي

مناقش / محمد نورالسيد أحمد

مناقش / سيد عبدالهادي عيد نوح

الموضوع

Genetic algorithms. Machine Learning.

تاريخ النشر

2021.

عدد الصفحات

159 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/7/2021

مكان الإجازة

جامعة المنوفية - كلية الحاسبات والمعلومات - قسم هنذسة وعلوم الحاسباث

الفهرس

Only 14 pages are availabe for public view

from

189

from

189

Abstract

Due to the rapid growth of the amount of documents uploaded to the
Internet, information retrieval has become a very important task. Information
retrieval systems aim to retrieve the relevant documents; from a collection of
documents; as a response of an input user query. The queries and stored
documents are unstructured data. The Arabic language is important, widely
used, and spoken by hundred-millions of people. Adopting Arabic information
retrieval is important because Arabic is our native language and limited
research efforts were presented in the literature discussing the Arabic retrieval
systems compared with those presented for the English language.
This work aims to develop an effective Arabic information retrieval model.
The developed model involves important themes such as document-collection,
query processing, document representation, indexing, document-query
matching, ranking, and evaluation of the retrieval process. The performance of
the developed Arabic retrieval model is evaluated using a collection of Arabic
documents as a test-bed.
To enhance the effectiveness of the retrieval process, query reformulation
approaches based on: relevance feedback, semantics of query keywords, and
hybrid of both are adopted and evaluated. The performance of the query
reformulation approaches is better than those without reformulation. The query
reformulation improved the performance by about 27% and 14% respectively
using the semantics of a query keywords and relevance feedback approaches.
The performance was improved by about 15% to 35% using the hybrid of both.
The query expansion process is also discussed using the word embedding
concept. The continuous-bag-of-words (CBOW), Skip-Gram, and the
amalgamation of both are also analyzed, implemented, and evaluated.Important key elements are also discussed such as the vector length, window
size, and the number of expanded terms to each query keyword (i.e. candidate
terms). The performance was improved by expanding a query keyword by only
one expanded (or candidate) term and sometimes two. The hybrid of both
CBOW and Skip-Gram presented the best performance compared to each
individual model as well as that performance without expansion. The adopted
query expansion approaches are operated and tested using the standard CNN
dataset. Moreover, the performance improvement was about 24% to 36% for
CBOW compared with some of those published in the literature. The
performance of CBOW outperforms the Skip-Gram by about 31% while the
hybrid approach outperforms the CBOW model by about 17.14%.
Three machine learning approaches for classifying text and/or documents are
also discussed. The approaches are decision tree (DT), Naïve Bayes (NB), and
support vector machine (SVM). The machine learning approaches are
analyzed, operated and evaluated using two benchmarks as test-beds. The testbeds are the BBC and Aljazeera Arabic datasets. The performance of the SVM
classifier outperforms the other adopted ones.
Moreover, a set of feature selection methods are conducted and applied. This
includes: term-weighting, Gini Index, Information Gain, Chi-Square, and the
amalgamation of them. Also, a feature selection method is proposed. The
proposed method is based on semantic feature fusion and muli-words
(abbreviated as SF-MW). Using the proposed method, the number of features is
reduced due to the fusion of some features. The proposed method improved the
classification performance by up to 22% using the BBC and Aljazeera Arabic
datasets.
To more enhance the quality of the text classification process, a deep learning
approach is adopted. The deep convolution neural network (CNN) learning
approach is considered and discussed for its effectiveness and its usability for different applications. A set of experiments are operated to evaluate the
performance of the CNN deep learning approach. Also, a comparative study is
presented among the behavior of SVM, SVM based on SF-MW, and the CNN
approaches. The SVM based on SF-MW outperforms the SVM classifier. The
performance of CNN learning approach was the best efficient using multiple
convolution layers. The performance of the CNN approach is improved by
about 25% compared with the SVM classifier. The CNN deep learning model
is efficient and also promising for the big size of datasets.