![]() | Only 14 pages are availabe for public view |
Abstract The explosion of digital and electronic data in many fields (e.g., medical, industry, and social network) leads to the appearance of a new term called „big data‟, which desperately need to mine for many critical reasons (e.g., making appropriate decisions and preparing short and long-term strategic plans). The current data-mining algorithms, especially instance-based learning, become impractical to process and mine such data. Therefore, the data reduction techniques are utilized prior applying the mining algorithms to scale data-size down, which, in turn, reduce the necessary computational resources. The instance selection (IS) methods are used along with instance-based learning to select the relevant prototypes needed for effectively performing the learning task. However, the classical IS methods have been disappointed to address two main challenging issues in a reasonable way: (1) effectiveness challenge, and (2) memory limitation challenge. Therefore, new and enhanced IS methods have been put forward in the literature to address these challenges. The density-based methods have achieved promising results merely towards addressing the effectiveness challenge. It uses the local density function to search for and select relevant instances. However, their promising solutions are always in the favor of increasing reduction rate, which of course are unsuitable for most application domains. To address the memory limitation challenge, the data partitioning approaches have been advocated to partition a given big dataset into a group of manageable subsets. The IS methods can then be applied to these subsets in an individual manner. However, the current partitioning approaches highly decrease the performance of the applied IS methods, which make them not recommend in several application domains. Our ultimate goal of the thesis is to propose a new operational, layered approach to address these challenges in an integrated framework in an intelligent manner. The term „intelligent‟ refers to the capabilities of our methods and approaches proposed in the framework to balance between the high reduction rate and classification accuracy ABSTRACT IV on one side and the low memory resources and computation time from the other side. We commence with proposing three combination methods and then proceed to propose two novel methods to address the effectiveness challenge. More precisely, we combine two existing IS methods from condensation and edition taxonomies in the instance selection layer. We then compare the behavior of the combined methods with both individual methods and iterative case filtering (ICF) method that belongs to hybrid taxonomy using four standard datasets. The experimental results demonstrate the ability of our combination methods to approximately achieve the effectiveness results of the ICF method and remarkably improve the effectiveness results of the individual methods. Moreover, our combination methods require extremely lower computation time than the ICF method. At this point, we propose two novel methods by modifying the local density function used in the density-based methods into the global one and adding two new functions, called relevance and irrelevance functions to identify relevant instances. Such novel methods are compared with the current density-based methods using eighteen standard datasets. The experimental results show that our first proposed method has better classification accuracy results than the density-based methods, while our second proposed method improves the effectiveness results of the first one and the density-based methods. Having positive results, we address the memory limitation challenge in the data partitioning layer above the instance selection layer by firstly performing a comparative study between two well-known data partitioning approaches (i.e., Random Partitioning (RP) and Stratification Partitioning (SP)) and Non-Partitioning (NoP) approach using four standard datasets. The standard condensed nearest neighbor (CNN) method is used in the instance selection layer. The experimental results show that the two approaches lead to a high degradation in the performance of the CNN method compared to the NoP approach in terms of the reduction rate and effectiveness metrics. For addressing such degradation, we propose two new data partitioning approaches: (1) the first approach distributes the instances to the partitioned subsets based on a given distance metric and overlaps the instances in two ABSTRACT V subsets once they satisfy a given threshold, and (2) the second approach follows the same strategy of the first approach in addition to ensuring equal classes distribution in the partitioned subsets. We also develop a non-overlapped version from each proposed approach to validate the importance of the overlapping. The CNN method is used to validate and compare the partitioning approaches using eight standard datasets. The experimental results show that our proposed approaches improve the performance in terms of the reduction rate and effectiveness metrics compared to other employed approaches. Moreover, we advance the state-of-the art by performing a comparative study to validate and compare the performance of our partitioning approaches and three standard partitioning approaches using four IS methods from different taxonomies over eight standard datasets with different subset-sizes to study their scalability aspect. Based on the study findings, we introduce a set of recommendations that can aid the prospective researchers to select the intelligent and appropriate combination of partitioning approaches and IS methods with respect to the fulfillment of the required evaluation metric (e.g., reduction rate and classification accuracy). To increase the feasibility and usability of our framework, we integrate our framework with a suitable user interface layer, user interface management and validation layer, and system support layer. All these layers are seamlessly integrated and fully implemented using Archived Java to produce a new open toolkit available for prospective researchers to perform comparable experiments. Keywords:- Big Data; Data Mining; Data Preprocessing; Data Reduction; Classification; Instance selection; Instance-Based Learning; Knowledge Discovery in Databases. |