![]() | Only 14 pages are availabe for public view |
Abstract Due to the rapid growth of the amount of documents uploaded to the Internet, information retrieval has become a very important task. Information retrieval systems aim to retrieve the relevant documents; from a collection of documents; as a response of an input user query. The queries and stored documents are unstructured data. The Arabic language is important, widely used, and spoken by hundred-millions of people. Adopting Arabic information retrieval is important because Arabic is our native language and limited research efforts were presented in the literature discussing the Arabic retrieval systems compared with those presented for the English language. This work aims to develop an effective Arabic information retrieval model. The developed model involves important themes such as document-collection, query processing, document representation, indexing, document-query matching, ranking, and evaluation of the retrieval process. The performance of the developed Arabic retrieval model is evaluated using a collection of Arabic documents as a test-bed. To enhance the effectiveness of the retrieval process, query reformulation approaches based on: relevance feedback, semantics of query keywords, and hybrid of both are adopted and evaluated. The performance of the query reformulation approaches is better than those without reformulation. The query reformulation improved the performance by about 27% and 14% respectively using the semantics of a query keywords and relevance feedback approaches. The performance was improved by about 15% to 35% using the hybrid of both. The query expansion process is also discussed using the word embedding concept. The continuous-bag-of-words (CBOW), Skip-Gram, and the amalgamation of both are also analyzed, implemented, and evaluated.Important key elements are also discussed such as the vector length, window size, and the number of expanded terms to each query keyword (i.e. candidate terms). The performance was improved by expanding a query keyword by only one expanded (or candidate) term and sometimes two. The hybrid of both CBOW and Skip-Gram presented the best performance compared to each individual model as well as that performance without expansion. The adopted query expansion approaches are operated and tested using the standard CNN dataset. Moreover, the performance improvement was about 24% to 36% for CBOW compared with some of those published in the literature. The performance of CBOW outperforms the Skip-Gram by about 31% while the hybrid approach outperforms the CBOW model by about 17.14%. Three machine learning approaches for classifying text and/or documents are also discussed. The approaches are decision tree (DT), Naïve Bayes (NB), and support vector machine (SVM). The machine learning approaches are analyzed, operated and evaluated using two benchmarks as test-beds. The testbeds are the BBC and Aljazeera Arabic datasets. The performance of the SVM classifier outperforms the other adopted ones. Moreover, a set of feature selection methods are conducted and applied. This includes: term-weighting, Gini Index, Information Gain, Chi-Square, and the amalgamation of them. Also, a feature selection method is proposed. The proposed method is based on semantic feature fusion and muli-words (abbreviated as SF-MW). Using the proposed method, the number of features is reduced due to the fusion of some features. The proposed method improved the classification performance by up to 22% using the BBC and Aljazeera Arabic datasets. To more enhance the quality of the text classification process, a deep learning approach is adopted. The deep convolution neural network (CNN) learning approach is considered and discussed for its effectiveness and its usability for different applications. A set of experiments are operated to evaluate the performance of the CNN deep learning approach. Also, a comparative study is presented among the behavior of SVM, SVM based on SF-MW, and the CNN approaches. The SVM based on SF-MW outperforms the SVM classifier. The performance of CNN learning approach was the best efficient using multiple convolution layers. The performance of the CNN approach is improved by about 25% compared with the SVM classifier. The CNN deep learning model is efficient and also promising for the big size of datasets. |