Author: Malhat, Mohamed Gaber Mohamed./ Title: Intelligent Reduction Techniques for<br>Big Data /

Search In this Thesis

العنوان

Intelligent Reduction Techniques for
Big Data /

المؤلف

Malhat, Mohamed Gaber Mohamed.

هيئة الاعداد

باحث / محمد جابر محمد ملهط

مشرف / اشرف بهجات السيسي

مناقش / جمال محروس عطيه

مناقش / حاتم سيد احمد

الموضوع

Information science. Big data. Artificial intelligence.

تاريخ النشر

2019.

عدد الصفحات

166 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

Information Systems

تاريخ الإجازة

17/9/2019

مكان الإجازة

جامعة المنوفية - كلية الحاسبات والمعلومات - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

166

from

166

Abstract

The explosion of digital and electronic data in many fields (e.g., medical, industry,
and social network) leads to the appearance of a new term called „big data‟, which
desperately need to mine for many critical reasons (e.g., making appropriate decisions
and preparing short and long-term strategic plans). The current data-mining
algorithms, especially instance-based learning, become impractical to process and
mine such data. Therefore, the data reduction techniques are utilized prior applying
the mining algorithms to scale data-size down, which, in turn, reduce the necessary
computational resources. The instance selection (IS) methods are used along with
instance-based learning to select the relevant prototypes needed for effectively
performing the learning task. However, the classical IS methods have been
disappointed to address two main challenging issues in a reasonable way: (1)
effectiveness challenge, and (2) memory limitation challenge. Therefore, new and
enhanced IS methods have been put forward in the literature to address these
challenges. The density-based methods have achieved promising results merely
towards addressing the effectiveness challenge. It uses the local density function to
search for and select relevant instances. However, their promising solutions are
always in the favor of increasing reduction rate, which of course are unsuitable for
most application domains. To address the memory limitation challenge, the data
partitioning approaches have been advocated to partition a given big dataset into a
group of manageable subsets. The IS methods can then be applied to these subsets in
an individual manner. However, the current partitioning approaches highly decrease
the performance of the applied IS methods, which make them not recommend in
several application domains.
Our ultimate goal of the thesis is to propose a new operational, layered approach to
address these challenges in an integrated framework in an intelligent manner. The
term „intelligent‟ refers to the capabilities of our methods and approaches proposed in
the framework to balance between the high reduction rate and classification accuracy
ABSTRACT
IV
on one side and the low memory resources and computation time from the other side.
We commence with proposing three combination methods and then proceed to
propose two novel methods to address the effectiveness challenge. More precisely, we
combine two existing IS methods from condensation and edition taxonomies in the
instance selection layer. We then compare the behavior of the combined methods with
both individual methods and iterative case filtering (ICF) method that belongs to
hybrid taxonomy using four standard datasets. The experimental results demonstrate
the ability of our combination methods to approximately achieve the effectiveness
results of the ICF method and remarkably improve the effectiveness results of the
individual methods. Moreover, our combination methods require extremely lower
computation time than the ICF method. At this point, we propose two novel methods
by modifying the local density function used in the density-based methods into the
global one and adding two new functions, called relevance and irrelevance functions
to identify relevant instances. Such novel methods are compared with the current
density-based methods using eighteen standard datasets. The experimental results
show that our first proposed method has better classification accuracy results than the
density-based methods, while our second proposed method improves the effectiveness
results of the first one and the density-based methods.
Having positive results, we address the memory limitation challenge in the data
partitioning layer above the instance selection layer by firstly performing a
comparative study between two well-known data partitioning approaches (i.e.,
Random Partitioning (RP) and Stratification Partitioning (SP)) and Non-Partitioning
(NoP) approach using four standard datasets. The standard condensed nearest
neighbor (CNN) method is used in the instance selection layer. The experimental
results show that the two approaches lead to a high degradation in the performance of
the CNN method compared to the NoP approach in terms of the reduction rate and
effectiveness metrics. For addressing such degradation, we propose two new data
partitioning approaches: (1) the first approach distributes the instances to the
partitioned subsets based on a given distance metric and overlaps the instances in two
ABSTRACT
V
subsets once they satisfy a given threshold, and (2) the second approach follows the
same strategy of the first approach in addition to ensuring equal classes distribution in
the partitioned subsets. We also develop a non-overlapped version from each proposed
approach to validate the importance of the overlapping. The CNN method is used to
validate and compare the partitioning approaches using eight standard datasets. The
experimental results show that our proposed approaches improve the performance in
terms of the reduction rate and effectiveness metrics compared to other employed
approaches. Moreover, we advance the state-of-the art by performing a comparative
study to validate and compare the performance of our partitioning approaches and
three standard partitioning approaches using four IS methods from different
taxonomies over eight standard datasets with different subset-sizes to study their
scalability aspect. Based on the study findings, we introduce a set of recommendations
that can aid the prospective researchers to select the intelligent and appropriate
combination of partitioning approaches and IS methods with respect to the fulfillment
of the required evaluation metric (e.g., reduction rate and classification accuracy). To
increase the feasibility and usability of our framework, we integrate our framework
with a suitable user interface layer, user interface management and validation layer,
and system support layer. All these layers are seamlessly integrated and fully
implemented using Archived Java to produce a new open toolkit available for
prospective researchers to perform comparable experiments.
Keywords:- Big Data; Data Mining; Data Preprocessing; Data Reduction;
Classification; Instance selection; Instance-Based Learning; Knowledge Discovery in
Databases.