Search In this Thesis
   Search In this Thesis  
العنوان
Novel Methodologies for discovering clusters of arbitrary shapes & densities in high dimensional data , with applicationd/
الناشر
Noha Abdel Rahman Yousri,
المؤلف
Yousri,Noha Abdel Rahman
الموضوع
Data analysis Computer scinece
تاريخ النشر
2008 .
عدد الصفحات
viii,158P.:
الفهرس
Only 14 pages are availabe for public view

from 174

from 174

Abstract

‎In the presence of large high dimensional data sets, efficient clustering algorithms are required to discover clusters of unknown sizes, shapes and densities. Such an urge stems from various drawbacks of current algorithms; mainly static clustering models, intolerable time performance and difficulty in selecting parameters.
‎Primarily, a novel dynamic model for clustering is proposed based on distance relatedness concepts. Both local and global distance relatedness measures are developed in an attempt to achieve the required goal. These come as the building blocks of the proposed clustering criteria, which finally pour into ”Mitosis” algorithm. Comparing ”Mitosis” to current clustering algorithms shows its relative efficiency with respect to both the quality of clustering obtained and the time performance. An accompanying parameter selection scheme which promotes the applicability of the algorithm is also presented.
‎To further recognize the proposed model in real life applications, the algorithm is applied to gene expression data; A direction which reveals another dimension in analyzing gene expression, only slightly known in literature, but not from a clustering-based perspective. Discovering the difference between coherency of gene expression obtained by traditional algorithms, and connectedness as revealed by ”Mitosis” contributes to the main achievement in this application. A number of gene expression data sets are used to illustrate the efficiency of the proposed clustering algorithm compared to related algorithms used in this area.
‎Secondly, exploratory and analytic methodologies that complement the clustering algorithm are proposed. Basically, these are: a neighbour searching algorithm, a validity measure, outlier pattern analysis and core pattern analysis. The first is a pre-clustering step and its enhancement improves the time performance of the algorithm. The second is a fundamental need for evaluating clustering solutions and is still in its infancy when considering arbitrary shaped clusters. However, outlier and core pattern analysis are different because they playa role towards accomplishing the data analysis job; outlier analysis further investigates possible novelties in data, while core pattern analysis provide more insight into both the data and cluster structure.