Author: Nour El-Deen, Doha Taha./ Title: A Design of an Automatic Web Page Classification System /

Search In this Thesis

العنوان

A Design of an Automatic Web Page Classification System /

المؤلف

Nour El-Deen, Doha Taha.

هيئة الاعداد

باحث / ضـحى طـه نور الديـن رياض

مشرف / طــارق مصطفـى محمود

مشرف / طــارق عبدالحفيظ عبدالرحمن

مناقش / أحمد شرف الدين أحمد

مناقش / عادل أبو المجد

الموضوع

Intelligent agents (Computer software).

تاريخ النشر

2017.

عدد الصفحات

111 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/1/2017

مكان الإجازة

جامعة المنيا - كلية العلوم - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

144

from

144

Abstract

In the last decade the increasing popularity of the World Wide Web has led to an exponential growth in the number of pages available on the Web.
This huge number of Web pages makes it increasingly difficult for users to find the required information. In searching the Web for specific information,
one gets lost in the vast number of irrelevant search results. The amount of information that is accessible on the Web has increased enormously in a short period of time. This increase of information is a desirable evolution, but it has also made the problems with the Web more evident. Everyone that has used the Web to search for information knows that it is not as easy or as fast as one would like it to be.
Classification is based on a classifier which is trained on a set of data labeled with predefined categories and then applied to label future examples. It plays a fundamental role in a number of essential tasks in information retrieval
and management. Classification of web page is important for companies that provide marketing and analytics platforms, because it can help them to build a healthy mix of listings on search engines and large directories.
Classification of web page content is essential to many tasks in web information retrieval such as maintaining web directories and focused crawling. There are several methods of classification web pages, the most popular methods are Naïve Bayesian Classifier, Support Vector Machine Classifier and K-Nearest Neighbor Classifier.
The main objective of this thesis is building an efficient system for classifying web pages. The proposed classifying System consists of four phases:
Training phase, Classification phase, Improving phase and Testing phase.
In the training phase, we use 30% from database to train the system.
In the classification phase, we use the Naive Bayes algorithm, k-Nearest Neighbor and Support Vector Machine to classify the Web pages. In the improving phase, we try to improve the performance of the system via combine the three (Naive Bayes,
k-Nearest Neighbor and Support Vector Machine) algorithms. In the testing phase, we randomly choose dataset consists 70% of database from the Reuters - 21578.
We have conducted six types of experiments using in the proposed system.
The aim of the first experiment is to evaluate the effectiveness of the proposed system using the Naive Bayesian Classifier alone. The aim of the second experiment is to evaluate the effectiveness of the proposed system using the Support Vector Machines Algorithm alone. The aim of the third experiment is to evaluate the effectiveness of the proposed system using hybrid Naïve Bayesian and Support Vector Machine.
The aim of the fourth experiment is to evaluate the effectiveness of the proposed system using K-Nearest Neighbor Algorithm alone. The aim of the fifth experiment is to evaluate the effectiveness of the proposed system using hybrid Naïve Bayesian
and K- Nearest Neighbor. The aim of the last experiment is to evaluate the effectiveness of the proposed system using hybrid Naïve Bayesian, Support Vector Machine and K- Nearest Neighbor algorithms. In this implementation, Naïve Bayesian algorithm is used in the feature selection process and then the support vector machine is used in the classification process. At the end of this process,
the accuracy rate of the classifier is calculated. According, some of the Web pages are classified correctly and the other are classified incorrectly. The Web pages that classified incorrectly are classified again K-Nearest Neighbor algorithm.
The experimental results shows that the performance of classifying web page by hybrid Naïve Bayesian classifier, Support Vector Machine and K-Nearest Neighbor algorithm is better than using Naïve Bayesian alone as always used to get the highest and fastest classifier or using K-Nearest Neighbor alone or using Support Vector Machine alone to reduce the false positive rate and get highest accuracy.
The experimental results applied on 10.000 web pages (documents) show a high efficiency with the less number of false positive rate (on average) 0%,
the true positive rate (on average) 1%, F-measure (on average) 1% and overall accuracy (on average) 99.98%.
This thesis is organized in six chapters.
 In chapter 1, we introduce the classification; machine learning and web mining, also introduce the problem formulation and thesis objective.
 In chapter 2, we review the related works, literature of related concepts on web page classification such the definition of web page classification, text classification and features.
 In chapter 3, we introduce the techniques used in proposed system related to web page classification. Then we describe the organization’s problem that we need to solve in our research. At last, we introduce some metrics used to evaluate a classifiers performance such as precision, recall and F-measure.
 In chapter 4, we conduct proposed techniques with different feature sets and test Naïve Bayesian, Support Vector Machine and k- Nearest neighbor classification algorithms on our data set, because they outperform other algorithms in most cases.
 In chapter 5, we conduct the experimental results.
 In chapter 6, we conclude the thesis by providing a summary of our research contributions and discussing the research limitations and future works.