الفهرس | Only 14 pages are availabe for public view |
Abstract The reliance of society on massive quantities of electronically stored data has led to a symmetrical increase in errors in this data. Real-world data have significant problems that need to be solved before the processing phase. Such problems may occur during data warehousing that includes data collection from multiple sources of organizations, data entry or other causes. Data cleaning is an approach that was designed to deal with quality problems in integrating different data sources and handling them together before their loading into data warehouse. The proposed system is designed to perform data cleaning using machine learning algorithms especially soft computing techniques. The proposed system depends on eliminating the unrelated attributes of the submitted data and duplicated information which can be gained from other attribute based on evaluation rather using instance selection and consistency cleanup to achieve cleaned data. The proposed system consists of four phases; first, consistency checker of the submitted data before and after the cleaning process. Second, feature subset selection evaluation that can be correlation based or fuzzy rough based evaluator which is associated mutually with two different search algorithms; Hill Climbing and Best first search techniques. Third, instance selection reduction, in which dataset are reduced and pre-processed through using fuzzy rough sets to eliminate redundancy, duplications and noisy instance. Finally, the missing value handler, this is used to replace missing values using the mean of attribute. The proposed system benchmarked the use of fuzzy rough mechanism as feature selection and instance reduction to be cleaning tool. The system has been evaluated using five different datasets in their structures, types and values. The system has been evaluated and compared based on accuracy and consistency of data. The proposed work represents an extensible framework for data cleaning using data reduction strategy both feature and instance as cleaning tools. The importance of using the proposed system is that feature selection maintains the original meaning of features after reduction and instance reduction proves the approach can reduce the number of instances while preserving high classification accuracies besides consistency gradually guarantee. Data cleaning using fuzzy rough set theory is proven achieve symmetric or better classification accuracy when compared to the results from original data and is more powerful than other approaches in the comparative study. |