Author: Al-Kuhlani, Al-Hasan Ali Hefdhaldin./ Title: Designing a Smart System for Glycan Analysis using Artificial Intelligence Techniques \

Search In this Thesis

العنوان

Designing a Smart System for Glycan Analysis using Artificial Intelligence Techniques \

المؤلف

Al-Kuhlani, Al-Hasan Ali Hefdhaldin.

هيئة الاعداد

باحث / الحسن علي حفظ الدين الكحلاني

مشرف / عبد البديع محمد سالم

مشرف / محمد إسماعيل رشدي

مشرف / ولاء خالد بن الوليد

تاريخ النشر

2023.

عدد الصفحات

216 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

1/1/2023

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

216

from

216

Abstract

Glycans, or complex carbohydrates, are essential molecules that play critical roles in numerous biological processes and have been associated with various diseases. Traditional experimental methods of glycan analysis are costly and time consuming. Moreover, the complexity of glycan structures, the ever-increasing volume of glycobiology data, and the intricate relationships between glycans and other biological molecules, especially proteins, further complicate glycan analysis. To address these challenges, glycoinformatics has emerged as a subfield of bioinformatics aimed at developing software, algorithms, and databases for processing and mining glycan data.
Integrating glycoinformatics with Artificial Intelligence (AI) techniques has led to significant advances in glycan analysis, enabling researchers to explore the complexity and diversity of glycans and their relationships with other biological molecules. AI has proven useful in glycan analysis for predicting glycosylation and glycation sites, as well as glycan classification, but challenges persist. Therefore, this thesis presents three AI-based systems to improve glycosylation/glycation site prediction and a system for glycan classification.
The first system, named PUStackNGly, presents a method for predicting N-linked glycosylation using bagging Positive-Unlabeled (PU) learning and stacking ensemble machine learning. To extract comprehensive sequence and structural based features, different feature extraction descriptors are employed, followed by ensemble-based feature selection to choose the most significant and stable features. PU learning selects reliable negative samples from the unlabeled samples, after which stacking ensemble learning is performed using four base classifiers: Logistic regression, artificial neural networks, random forest, and support vector machine. Experimental results demonstrate that PUStackNGly performs significantly better than supervised learning methods and existing tools, with 95.11% accuracy, 100% recall, 80.7% precision, 89.32% F1 score, 96.93% AUC, and 0.87 MCC on an independent dataset.
Recent advancements in transformer-based language models based on deep learning have been applied to the field of proteomics by developing Protein Language Models (PLMs) for protein sequence representation. The second system proposes a novel method called PTG-PLM to improve the performance of

Post-Translational Modification (PTM) glycosylation and glycation site prediction. PTG-PLM employs Convolutional Neural Networks (CNNs) and embeddings extracted from six recent PLMs. Model is trained and evaluated on two public datasets for glycosylation and glycation site prediction. On the N-glycosylation testing dataset, PTG-PLM outperforms existing tools and supervised learning methods with 0.965 accuracy, 1 recall, 0.852 precision, 0.92 F1, 0.978 AUC, and 0.902 MCC. Additionally, on the glycation testing dataset, PTG-PLM outperforms existing tools and supervised learning methods with 0.64 accuracy, 0.67 recall, 0.64 precision, 0.65 F1, and 0.28 MCC.
The third system is developed to improve O-glycosylation site prediction using a transformer-based PLM and Machine Learning (ML). The dataset was collected and prepared from a recent data source called OGP (O-glycoproteins). The TAPE PLM was used to feature extraction from the peptide sequences using the embedding strategy. Then, feature selection was implemented using the linear SVM to select informative features. The Extreme Gradient Boosting (XGBoost) ensemble-based ML method was utilized for classification and prediction. The proposed model achieved high performance results with 0.776 accuracy, 0.739 sensitivity, 0.813 specificity, 0.83 AUC, and 0.554 MCC when compared with the traditional machine learning methods. On an independent dataset, the proposed method performed better than the latest existing methods for predicting O-glycosylation sites.
The fourth system, named GNNGLY, is proposed for glycans classification using graph neural networks. Firstly, Glycans are represented as molecular graphs, where atoms are represented as nodes and bonds are represented as edges. Graph Convolutional Networks (GCNs) are then used to make predictions on eight taxonomic classification levels and for the level of immunogenicity property. Results indicate that this approach outperforms traditional machine learning methods and when compared to other existing tools for glycan classification, it showed considerable performance results.
The integration of AI techniques in glycan analysis has enormous potential to enhance our understanding of glycans and their role in biological processes and diseases, leading to significant advancements in the field.