![]() | Only 14 pages are availabe for public view |
Abstract Recently. there has been a considerable increase in the availability of full-text documcnt collections in electronic fonn. This has created a need for tools and techniques that assist users in organizing these collections, specifically, these are of great interest to provide a solution for information categorization. This is especially true (’or web-based documents. Among the main methods l’or categorization is document clustering. Document clustering attempts to organize objects into groups, such that objects within a group arc more similar to each other compared to objects belonging to different groups. Generally, any clustering tl.ochnique can be divided into two stages. The tirst stage is the data representation model. The second stage is the clustering algorithm that produces the clusters based on the input data representation. In this thesis, we propose a promising method for data representation. The approach utilizes phrases rather than individual words as document features l’or document clustering. Hierarchical phrase grammar was used to extract frequently occurring phrases. These phrases, combined with words form the features rcpresenting the documents. In addition, in this thesis we propose hvo novd clustering methods based on unsupervised neural networks. These methods arc the lIier..trchieal SOMART (HSOMART) and Two Level-SOMART (TL-SOMART). Both of these methods are based on the use of two successful models of unsupervised neural networks. namely, the Self-Organizing Map (SOM) and Adaptive Resonance Theory (ART). These models have b:Jth demonstrated promising results in the task of document clustering. These approaches are suitable for kxtual input. being capable of identifying structure of high dimensionality within a body of natural language text. These methods are also capabk of successfully handling data that contains nOise. HSOMART method IS built up from a hierarchically organized combined SOM and ART neural nclworks with layered architecture wherc each layer consists of a number of independent SOMs or ARTs. The key idea of the HSOMART is based on, combining the f<.:st learning capabi!ity of SOM to generate comp:iel clusters with the accuracy of the cluskrs produced by ART. OIl the other hand, in case of the TL-SOMART, the sOM IS used as a dimension reduction 8Idhod in the first stage. This is achieved by mapping a high-dimensional data space based on words or phrases into low-dimensional space based on clusters produced by multiple SOM. The ART in the St.’COnd stage is used, similar to the IISOMART, to produce the linal clusters UIiR& a reduced vector space. The experimental results using the REUTERS c()rpus, are presented. Results show signiticant improvement of the suggested data representation and clustering methods evaluated by the entropy as well as the F-measure. It also show that clustering using the phrase based features combined with words achieved a better quality than clustering using words only. and demonstrate an improvement in the clustering performance using HSOl’v1!\RT and TL- SOMART in both quality and time execution. |