![]() | Only 14 pages are availabe for public view |
Abstract The fast growth of the information on the Web has brought up some problems in the search process. One of these problems is that the general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic. Another problem is the massive increase in the number of pages to be indexed by Web search systems. Web crawling is the process used by search engine to collect pages from the web This thesis is concerned with enhancing the quality of the retrieved pages which contain the most relevant information for the users. In order to do that , we introduce two directions to work on First, we start by reducing the number of training pages used by classifier. This is achieved by introducing a proposed algorithm of feature selection, which uses the Document Frequency technique (DF) for the term in the category. Second step is Web page classification. Two famous techniques of Web page classification are used: (i) Support Vector Machine (SVM) [with linear and nonlinear methods], and (ii) Naive Bayes Classifier (NBC). The proposed algorithm, using DF, reduces the redundancy during feature selection and increases the accuracy during Web page classification. We argue that this method will guarantee more consistent set of training pages than the traditional algorithms. |