الفهرس | Only 14 pages are availabe for public view |
Abstract In recent years, the size of data in several areas has been increased enormously, leading to a search for techniques on how to properly benefit from these data. Many researchers and data analysts are developing methods to make the data valuable and useful, which led to the emergence of big data science which aims to process the big data in parallel. Meanwhile, the traditional techniques of data analysis are not effective for different reasons like volume, velocity, variety, variation, and veracity. 0 Thus it was required to innovate some novel data mining and machine learning-based techniques to deal with the large volumes of data. However, there are some difficulty in applying the traditional clustering techniques with new challenges. For example, the K-Means algorithm encounters some problems in selecting the initial centre randomly for each group and is hardily adaptable to work in a distributed manner to keep up with the huge amount of data. To solve the weak point of the K-Means algorithm, a search method using genetic algorithm is proposed to search for the initial point of each cluster instead of the random selection of the initial centre. At the same time, a feature selection algorithm is used to reduce the data dimensions and prepare it as the input of the K-Means algorithm. MapReduce, the most prominent big data processing algorithm capable of dealing with large datasets, combined with the genetic-based K-Means algorithm are the backbone for this research. Thus, the proposed model is called GK-Means. The experiments conducted as part of this study have achieved improved rates above 80% by using the feature selection algorithm. In addition, the results of applying the genetic algorithm and the results of combining genetic-based K-Means with MapReduce have achieved improved rates over 40%. The proposed algorithm was tested using measurements of the execution time and the best total sum of distances. |