Author: Ali, Doaa Abd Allah Abd El Mohsen./ Title: Clustering Identification using Swarm Intelligent Techniques /

Search In this Thesis

العنوان

Clustering Identification using Swarm Intelligent Techniques /

المؤلف

Ali, Doaa Abd Allah Abd El Mohsen.

هيئة الاعداد

باحث / Doaa Abd Allah Abd El Mohsen Ali

مشرف / Ensaf Hussein

مشرف / Hala Abdel-Galil

الموضوع

Computers and Information. Computer Science.

تاريخ النشر

2019.

عدد الصفحات

p. 115 :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Multidisciplinary تعددية التخصصات

تاريخ الإجازة

1/1/2019

مكان الإجازة

جامعة حلوان - كلية الحاسبات والمعلومات - Computer Science

الفهرس

Only 14 pages are availabe for public view

from

133

from

133

Abstract

Data clustering is considered an important data analysis and data mining technique. It is included in a variety of disciplines such as, information retrieval, machine learning, image analysis, pattern recognition and Bioinformatics. K-means algorithm is a popular, very simple, robust, highly efficient clustering algorithm. It can be used for a wide variety of data types. K-means suffers from its dependency on its initial centroids which fells the algorithm into the local optima and affects the clustering quality. It also suffers from the need for a prior knowledge of the number of the clusters.

The main objective of the thesis is to introduce a method to overcome the drawbacks of the k-Means algorithm. Swarm intelligence algorithms (SI) are powerful in searching for the global optima through guiding the iterative computation towards the best solutions. Due to this advantage, this thesis is focused on solving the K-means drawbacks using SI algorithms.

Firstly, a proposed method for solving the initial centroid dependency has been presented. In this method an integration between the K-means clustering method and the most recent swarm algorithms has been done. The most recent swarm intelligence algorithms are; Whale search algorithm, Crow search algorithm, Grasshopper swarm algorithm and Salp swarm algorithms. These integration methods are C-Crow, C-Salp, C-Whale, and C-Grasshopper.

Secondly, a fully automatic clustering method has been presented to find the optimal number of clusters. In this method, some modifications regarding the solution representation scheme and the clustering objective function were presented to the C-Crow method.

Two measurements were used to evaluate the clustering quality of the proposed methods. These measurements are the Purity and The Rand Index. The proposed integration methods were conducted on eight numerical (UCI) datasets. The experimental results showed that the average over 100 runs of the purity measurement and the Rand index measurement of the C-Crow is higher than the traditional K-means and C-Salp, C-Whale and C-Grasshopper for almost all datasets. Based on the average of the purity measurement over all datasets, C-Crow achieved the highest quality 67.285%, but C-Salp, C-Whale, C-grasshopper, and K-means achieved 66.44125%, 66.0875%, 66.17125%, and 63.91875% respectively. Based on the average of the Rand index over all datasets, C-Crow achieved highest accuracy 69.27375% but C-Salp, C-Whale, C-grasshopper, and K-means achieved 67.96%, 68.04375%, 68.2275%, and 66.71875% respectively. For fully automatic clustering, the experimental results revealed that the C-Crow with CS objective function (CS-Crow) is superior to the Elbow method for finding the optimal or near-optimal number of clusters. The CS-Crow achieved 63.293638%, 69.8425% with respect to the average of purity and the Rand index respectively. But Elbow method achieved 49.10325% and 63.122%.
Chapter 1
Introduction
Overview
Clustering is the process of dividing a set of objects into groups such that the group members are similar to each other and dissimilar to the other groups’ members. It is the main task in data mining and a common statistical data analysis technique. It is used in various fields such as image analysis, machine learning, data compression, pattern recognition, information retrieval, and Bioinformatics. Clustering can be achieved with various algorithms. These algorithms are different in their understanding of what forms a cluster and how to find it efficiently.
Popular concepts of clusters include small distances between group members, dense regions of the data space and particular statistical distributions. Partitional clustering is one of the main categories of the clustering methods. It decomposes a dataset into a set of disjoint clusters. Many partitional clustering algorithms try to minimize some measure of dissimilarity in the samples within each cluster while maximizing the dissimilarity of different clusters.
The most popular partitional clustering algorithm is the K-means algorithm. In general, The K-means is an unsupervised learning algorithm useful for partitioning a dataset into disjoint subgroups. Irrespective of its long history, K-means is still enjoying a widespread relevance . K-means has been ranked among the top 10 data mining algorithms . Despite of this, it suffers from some drawbacks. These drawbacks are its dependence on the clusters initial centers that affect the clustering results and cause falling into the local optima. Another drawback is that the number of clusters has to be known priory .
Swarm Intelligence (SI) is an innovative artificial intelligence category inspired by imitating the social cooperative behaviour of the swarms of the natural creatures for catching a prey, foraging for food. The swarm algorithm search process begins with a population which is generated randomly. This population is evolved over the next generations. The strengths of these methods lie in the combinations of the best individuals to form the individuals of the next generations. This causes the population to be optimized over the course of generations. Swarm Intelligence algorithms are characterized by their implementation simplicity, their ability to deal with various and different kinds of optimization problems and their ability to search multimodal landscape and avoiding local optima .
Research Problem
Enhancing the performance of the K-means algorithm by overcoming the local optima problem. The local optima problem results from its dependence on the initial centers and leads to poor clustering results. The K-means algorithm requires the number of clusters as an input, so finding the optimal number of clusters has a key role in improving the clustering results.