Author: Hameed, Bashar Ibrahim./ Title: Data cleaning in big databases /

Search In this Thesis

العنوان

Data cleaning in big databases /

المؤلف

Hameed, Bashar Ibrahim.

هيئة الاعداد

باحث / بشار إبراهيم حميد

مشرف / أحمد أبوالفتوح صالح

مشرف / ميرفت مصطفى أبوالخير

مناقش / حازم مختار البكري

الموضوع

Information technology. Electronic data processing - Quality control.

تاريخ النشر

2015.

عدد الصفحات

86 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Information Systems

تاريخ الإجازة

01/01/2015

مكان الإجازة

جامعة المنصورة - كلية الحاسبات والمعلومات - Department of Information System

الفهرس

Only 14 pages are availabe for public view

from

Abstract

The reliance of society on massive quantities of electronically stored data has led to a symmetrical increase in errors in this data. Real-world data have significant problems that need to be solved before the processing phase. Such problems may occur during data warehousing that includes data collection from multiple sources of organizations, data entry or other causes. Data cleaning is an approach that was designed to deal with quality problems in integrating different data sources and handling them together before their loading into data warehouse. The proposed system is designed to perform data cleaning using machine learning algorithms especially soft computing techniques. The proposed system depends on eliminating the unrelated attributes of the submitted data and duplicated information which can be gained from other attribute based on evaluation rather using instance selection and consistency cleanup to achieve cleaned data. The proposed system consists of four phases; first, consistency checker of the submitted data before and after the cleaning process. Second, feature subset selection evaluation that can be correlation based or fuzzy rough based evaluator which is associated mutually with two different search algorithms; Hill Climbing and Best first search techniques. Third, instance selection reduction, in which dataset are reduced and pre-processed through using fuzzy rough sets to eliminate redundancy, duplications and noisy instance. Finally, the missing value handler, this is used to replace missing values using the mean of attribute. The proposed system benchmarked the use of fuzzy rough mechanism as feature selection and instance reduction to be cleaning tool. The system has been evaluated using five different datasets in their structures, types and values. The system has been evaluated and compared based on accuracy and consistency of data. The proposed work represents an extensible framework for data cleaning using data reduction strategy both feature and instance as cleaning tools. The importance of using the proposed system is that feature selection maintains the original meaning of features after reduction and instance reduction proves the approach can reduce the number of instances while preserving high classification accuracies besides consistency gradually guarantee. Data cleaning using fuzzy rough set theory is proven achieve symmetric or better classification accuracy when compared to the results from original data and is more powerful than other approaches in the comparative study.