Author: Ahmed, Hadia Mohamed Raafat Mohmoud./ Title: Developing a Database Management System ForBiological Data /

Search In this Thesis

العنوان

Developing a Database Management System ForBiological Data /

المؤلف

Ahmed, Hadia Mohamed Raafat Mohmoud.

هيئة الاعداد

باحث / هادية محمد رأفت محمود أحمد

مشرف / حسني محمد ابراهيم

مناقش / محمد سعيد عبد الوهاب

مناقش / حامد محمد نصار

الموضوع

Database system.

تاريخ النشر

2010.

عدد الصفحات

147 P. ;

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Information Systems

الناشر

تاريخ الإجازة

26/9/2010

مكان الإجازة

جامعة أسيوط - كلية الحاسبات والمعلومات - Information Systems

الفهرس

Only 14 pages are availabe for public view

from

Abstract

The past two decades have seen a massive explosion in the amount of biological data available due to huge advances in the fields of molecular biology and genomics. As a result, computers have become an essential tool to help biological researches in storing, interpreting and analyzing these huge amounts of data. One of the biggest challenges facing biologists is the efficient storage of this massive data, to allow easy and reliable access to them, besides extracting useful information out of them.
Database Management System (DBMS) is one of the most powerful computer software used in business corporations to manage their large and complex data and to quickly extract useful reports. DBMS was the solution for biologists to organize their data and access them efficiently. However the biological data have different characteristics from the business data that the current DBMSs deal with. They require special tools to manage and analyze them.
This thesis performs an extensive study on the current DBMSs and how they handle biological data. It is shown that most of the general-purpose DBMSs available treat biological data like other business data ignoring their peculiar nature. The specific DBMSs are built over the general ones to provide tools to deal with this kind of data. However, they include the data as part of their system and do not allow biologists to modify or add new data to them.
This work proposes a new DBMS that is independent of general DBMSs. The proposed system stores the database in XML format and uses XML schema to define the database structure. It allows the users to create their own database where they can modify its content and query database using SQL to retrieve consistent results. It extends the available SQL to include special functions for the biological data. It provides some built-in data types to handle the special characteristics of the data. It is flexible as it allows users to create data types.
The proposed system includes two of the most important tools for the bioinformatics. It adds the NCBI BLAST as a tool to perform sequence similarity search on biological data sequence and displays the results in different formats. The other tool, protein automatic annotation (PAAT) is a new proposed tool to predict proteins function. To run these tools a connection to the Internet must be present. They have to be online as they use web-services (NCBI-BLAST, UniProt Dbfetch) to benefit from biological tools remotely transparently without the intervention of the user.
The proposed tool (PAAT) depends on a hybrid approach combining homology-based and machine learning approaches to produce a dynamic decision tree classifier. Then, this classifier is used to predict protein function. It takes into consideration the hierarchical information of GO Molecular Function.
A number of tests are carried out to determine the performance of the new proposed system and its new tool. The system behaves as expected and produces XML database files that contain the user data. To test PAAT performance, 20,269 human genome protein records were used from UniProt database and PAAT was tested against other current tools, such as FUNcat, which is based on using a machine learning (neural network), and PEDANT. PAAT gave a 87% accuracy for correctly classified instances, while FUNcat gave a 66%, and PEDANT gave a 71% accuracy.