Author: Bahgat, Eman Mohamed./ Title: Efficient Email Classification Technique Based on Semantic Methods /

Search In this Thesis

العنوان

Efficient Email Classification Technique Based on Semantic Methods /

المؤلف

Bahgat, Eman Mohamed.

هيئة الاعداد

باحث / إيمان محمد بهجت

مشرف / إبراهيم فتحي معوض

مشرف / ولاء خالد جاد

مشرف / شيرين راضي عبد الغني

تاريخ النشر

2018.

عدد الصفحات

89 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Information Systems

تاريخ الإجازة

1/1/2018

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - نظم المعلومات

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Emails have become one of the major applications in daily life. It is one of the most popular ways of communication due to its easy accessibility, low sending cost and fast message transfer. The continuous growth in the number of email users has led to a massive increase of useful emails, in addition to unsolicited emails. The latter are known as spam emails which appear as a severe problem affecting the users’ and computer network performances. Managing and classifying the huge number of emails is an important challenge. Email filtering approach is the solution to manage such big size, in addition to isolate spam emails.
Recently, most of the approaches introduced in the literature to solve the huge number of spam emails. Filtering syntactic features handles the high dimensionality of emails.
This thesis proposes an email filtering approach based on the semantic methods. A framework is proposed, which consists of two parts. The first one uses classification techniques, where the body of email messages is analyzed and the terms are extracted from email body. Weights are assigned to terms (features) that can help to identify emails as spam or ham (i.e clean). An adaptation to this structure is proposed to reduce the extracted number of features, in which only meaningful terms are regarded by consulting an English dictionary.
In the second part, WordNet is introduced as an ontology to apply different semantic based similarity measures to reduce the number of features, space and time complexities. Moreover, to get the minimal optimal features set, feature dimensionality reduction is integrated. Two feature selection techniques are used: the Principal Component Analysis (PCA) and the Correlation Feature selection (CFS) are evaluated for such purpose.
Experimental results have been conducted and the proposed framework and methods have been tested on the standard benchmark Enron Dataset. It is a large public email database collection. SVM and Logistic Regression classifiers recorded the best accuracy values of 96%, followed by the Naïve Bayes with 92.3% accuracy value.
Integrating semantics and feature selection, the classifier Logistic Regression achieved the highest accuracy value of 95%. Followed by the Naïve Bayes and SVM having similar results of 94% accuracy value. It has been shown that when integrating the feature selection, the average recorded accuracy for the all used classifiers is enhanced reaching all above 90\%. This happens with more than 90\% feature space reduction. The experimental results also showed that CFS feature selection technique had better results compared to PCA.
Consequently, the proposed framework and the conducted experiments showed that the proposed work has a highly significant performance in terms of accuracy and time compared to other related work. The integration of the semantic concepts and feature reduction approaches added important benefits to enhancing the computational performance and the accuracy of classification.