الفهرس | Only 14 pages are availabe for public view |
Abstract Data integration became a backbone for many essential and widely used services; these services depend on integrating data from multiple sources in a fast and efficient way to be able to provide the accepted level of service it is committed to. As the size of data available on different environments became very huge, and systems are heterogeneous and autonomous, data integration became a crucial part of most modem systems. Data integration is defined as the process of combining data from heterogeneous sources so it can be used as one unified source. What data integration definition does not enforce is the way how integration takes place. Therefore, the applied technique is the user’s choice and it is derived from hosting environments and systems’ needs. According to the way integration takes place, there are two common techniques for data integration; the Virtual View approach and the Materialized View approach. In the Virtual View approach data is accessed from local source on-demand (e.g. data federation), while in the Materialized View approach data is extracted in advance, translated, filtered, and may be merged with relevant data from other sources, then stored in (logically) centralized repository(e.g. data warehouses). With the rapidly changing requirements of business, users and environments, Materialized View maintenance and modifications’ costs became very high, Thus, the Virtual View approach became a good candidate in these conditions. Furthermore the emergence of Web services developments and standards in support of automated business integration has driven major technological advances in the integration software space, most notably the Service-Oriented Architecture (SOA). Some data integration systems adopted the Service Oriented (SO) model and proved better and more organized design. Another affecting factor in the modem systems is shifting from adopting the idea of increasing the resource amount (scale-up) to solve large scale problems to the idea of tying together many low-end/commodity machines (scale-out) together as a single functional distributed system, which gives the illusion of having endless resources as provisioned by modem infrastructures like Grids and Clouds. This implies adopting new processing model to benefit from these infrastructures as proposed by MapReduce distributed processing model. As a result of all these variables, this study brings together the data integration system, Service Orientation and distributed processing to provide a mixture that improves performance especially with large number of data sources and can efficiently being hosted on modem infrastructures as Clouds. Therefore, Service Oriented Data Integration based on MapReduce (SODIM) system is proposed in this thesis to benefit from the emerging distributed processing model (MapReduce), and the loose-coupling provided by Service Orientation and web- services, to provide more extendibility, agility, reliability, and fault tolerance. The thesis provides a detailed description of how the techniques were brought together to eliminate current systems restrictions and provide more enhancements. An implementation is provided as a proof of concept and a case study is introduced as an evaluation method. |