Author: Saudi, Ali Salih Mahmoud./ Title: A RobustAudio-VisualSpeechRecognitionusing<br>ImprovedFeatures /

Search In this Thesis

العنوان

A RobustAudio-VisualSpeechRecognitionusing
ImprovedFeatures /

المؤلف

Saudi, Ali Salih Mahmoud.

هيئة الاعداد

باحث / Ali Salih Mahmoud Saudi

مشرف / Hazem Mahmoud Abbas

مشرف / Mahmoud Ibrahim Khalil

مناقش / Mohsen Abd-Elrazek Rashwan

تاريخ النشر

2019.

عدد الصفحات

160 P. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

هندسة النظم والتحكم

تاريخ الإجازة

1/1/2019

مكان الإجازة

جامعة عين شمس - كلية الهندسة - قسم هندسة الحاسبات والنظم

الفهرس

Only 14 pages are availabe for public view

from

160

from

160

Abstract

This research investigates the enhancement of a speech recognition system that uses both audio and visual speech information in noisy environments by presenting contri- butions in two main system stages: front-end and back-end. The double use of Gabor filters is proposed as a feature extractor in the front-end stage of both modules to capture robust spectro-temporal features. The Gabor features simulate the underlying processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it GAF and GVF. The performance obtained from the resulted GAF and GVF is compared to the performance of other traditional features such as MFCC, PLP, RASTA-PLP audio features, and DCT2 visual features. The experimental results show that a system utilizing GAF and GVF has attained a 98.89% and 69.23% recognition accuracy, respectively, which significantly outperforms the traditional audio and visual features, especially in a low-Signal to Noise Ratio (SNR) scenario.
To improve the back-end stage, a complete framework of synchronous Multi-Stream Hidden Markov Model (MSHMM) is used to solve the dynamic stream weight estimation problem for Audio-Visual Speech Recognition (AVSR). To demonstrate the usefulness of the dynamic weighting in the overall performance of AVSR system, we empirically show the preference of Late Integration (LI) compared to Early Integration (EI) especially when one of the modalities is corrupted. The results confirm that the proposed AVSR- LI model that utilize the dynamic weighting scheme outperforms the AVSR-EI model by a large difference by improving the average recognition accuracy from 90.65% to 92.83% with approximately 23.33% relative improvement.
Prompted by the great achievements of deep learning in solving AVSR problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Re- current Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with LI scheme. This model is termed as Gabor LSTM-BRNNav-LI model. The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. The results show that the proposed Gabor LSTM-BRNNav-LI model outperforms the Gabor HMMav-LI model by a large difference by improving the average recognition accuracy from 92.83% to 94.15% with approximately 18.39% relative improvement. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.
To demonstrate the effect of dynamic weighting on improving the AVSR performance in low SNR scenarios, we propose a set of experimental comparisons between the LI and
x
EI schemes. The results confirm that the proposed Gabor LSTM-BRNNav-LI model that utilize the dynamic weighting scheme outperforms the Gabor LSTM-BRNNav-EI model by a large difference by improving the average recognition accuracy from 92.18% to 94.15% with approximately 25.19% relative improvement. All of these models were trained and tested using clean and noisy recordings from CUAVE corpus.
Keywords:
Gabor Filters, Visual Feature Extraction, Audio-Visual Speech Recognition, Synchronous Multi-Stream Hidden Markov Model, Audio-Visual Integration, Stream Weight, Reliability Measures, Audio-Visual Databases, Bidirectional Recurrent Neural Network.