Author: Abd El Tawab, Mohammed Ramadan Saady./ Title: Design and Implementation of a Stand-alone Voice<br>Recognition System /

Search In this Thesis

العنوان

Design and Implementation of a Stand-alone Voice
Recognition System /

المؤلف

Abd El Tawab, Mohammed Ramadan Saady.

هيئة الاعداد

باحث / Mohammed Ramadan Saady Abd El Tawab

مشرف / Ashraf shamseldin Yahia

مشرف / Hatem M. El Borai

مناقش / El-Sayed A. El-Dahshan

تاريخ النشر

2016.

عدد الصفحات

155p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الفيزياء وعلم الفلك

تاريخ الإجازة

1/1/2016

مكان الإجازة

جامعة عين شمس - كلية العلوم - Physics

الفهرس

Only 14 pages are availabe for public view

from

155

from

155

Abstract

1.1 Introduction
It is a known fact that voice is a speaker dependent feature that enables us to recognize friends over the phone [1].Because it is a complicated signal produced as a result of several transformations occurring at several different levels: semantic, articulances in these transformations appeaas differences in the acoustic properties of the voice signal. Speaker related differences are a result of a combination of anatomical differences inherent in the vocal tract and the learned speaking habits of different individuals. In speaker recognition, all these differences can be used to discriminate between speakers [2].
The answer to the question “who is the speaker?” is the task of speaker recognition[3, and 4]. Speaker recognition is the process of identifying a speaker based on his/her voice [5, 6, and 7].Any Speaker Recognition System (SRS) performs its taskon two phases: enrollment or training, and testing phase. During the enrollment phase, a speaker model is created based on the speaker’s voice features. But the testing phase is carried by making a claim on the identity of an unknown speaker using both the trained models and the characteristics of the given speech.
SRS has several advantages over other biometric recognition systems. It is the only biometric system that process acoustic information so massive qualities are measured ranging from dialect and speaking style to pitch, spectral magnitudes, and formant frequencies so it is difficult to deceived with fake voice. Most other biometric recognition systems are image based so they can be deceived with fake image [8]. SRS doesn’t impose restrictions on a user. In contrast, the other biometric recognition systems impose some rules for their users [9]: in fingerprint recognition system, the user needs to place his hand on a hand rest for printing, and may expose the user to the transmission of germs and diseases from other users as all users use the same equipment by touch [10]. In iris or face recognition system, the user has to stand in a fixed position for picking his photograph based on iris or face features. In addition to all that, they can be rendered useless if the epidermis tissue is damaged in some way.
SRS doesn’t require special hardware. It is designed for use with any standard computer attached to it a better microphone. But the other biometric recognition systems require proprietary hardware, such as the vendor’s fingerprint sensor or iris-scanning equipment [8].
The performance of a SRS can be affected by some factors: misspoken of prompted phrases, emotional states, channel variations, sickness, and aging. These factors are better corrected by means other than algorithm (e.g., better microphones). SRS can be made somewhat robust against noise and channel variations [11, 12], ordinary human changes (e.g., time-of-day voice changes and minor head colds), and mimicry by humans and tape recorders [13]. To deal with the tape recorders problemsome methods use a small set of key words, andeach user is prompted to speak a given sequence of key words that is randomly chosen every time by the system is used [14]. The system accepts the input utterance only when it decides that the registered speaker has uttered the prompted text. So impostors cannot predict the text they will be prompted to say. The system can also reject utterances whose text differs from the prompted text, even if it is uttered by the registered speaker [15].
Speaker recognition has been applied for various applications. Embedding recognition in a product allows a unique level of hands-free and intuitive user interaction. Speaker recognition make it possible to verify the identity of persons accessing systems and control access to services such as voice dialing, banking by telephone, telephone shopping, database access services, information services, voice mail.
1.2 Literature Review
A lot of research work has been done over the past few decades into developing and enhancing reliable SRS. This section briefly reviews some of these works.
As the Fourier transform was not excess useful for speaker recognition using real speech signals [16], Some researchers have explored the use of wavelets to provide a richer feature space [17, 18, 19, 20,21, 22, 23, and 24]. Moreover, using the wavelet for speaker recognition augments the original feature space by the wavelet coefficients and yields a smaller set of more robust features in the final classifier [25, and 26]. So the researchers’ efforts turned to the wavelet techniques as a promising tool for parameterization the speech signal.
Some researchers depended on the wavelet techniques as a pre-stage for enhancing the capabilities of existing technique. In [27], the author decomposed the speech signal by using wavelet transform into various frequency sub-bands and the MFCC for each sub-band is calculated to form a feature vector. In [28], the speech signal is decomposed by the wavelet transform and the Linear Predictive Cepstral Coefficients (LPCC) is calculated for the approximation channel, and the entropy is calculated for the detail channel. In [29], the author holds a comparison between five methods for feature extraction. The best method gave a high recognition rate when the feature set is formed by a concatenation of the MFCCs which extracted from the original signal and that extracted by decomposing the signal via the wavelet transform to enhance the performance of the MFCCs based method in the presence of noise or telephone degradation. In [30], the author showed that by applying the DWT and WPT in the pre-processing stage for clean and noisy speech signals respectively, the signal recognition rate is increased.
Other researchers proposed medium methods based on both DWT and WPT ideas: in [31], the author proposed a method based on admissible wavelet packet which gives the freedom to partition the low frequency band or the high frequency band. In [32], the author presented an irregular decomposition, not all low and all high frequencies bands are decomposed in the decomposition levels, method based on WPT to improve the performance of speaker identification system.
Many studies based on the wavelet technique searched for a speech feature which has the ability to discriminative between individuals. Previous studies showed that the use of an energy index as features in recognition tasks is effective. In [33], the author formed features vectors based on a combination of three entropies types (sure, logarithmic energy, and norm), in [34], the author investigated a speaker identification system using adaptive wavelet sure entropy. In [35], the authors introduced a combination of genetic algorithm and wavelet packet transform used in the pathological assessment and the energy features are computed from a group of wavelet packet coefficients, and in [36], proposed a robust speech recognition system in a noisy environment using wavelet-based energy as a threshold for de-noise estimation. In [37], the author proposed a gender identification method based on the energy of speaker utterances where the speech signal is pre-processed, the processed signal is decomposed using Wavelet Packet Transform (WPT) to form the feature vectors which fed to Feed Forward Back Propagation Neural Network (FFBPNN), and his result showed that the proposed method has an effective analysis. In [32], a method to calculate the energy value of the wavelet coefficients in digital modulation recognition is proposed. In [38, 39], the author proposed a speaker identification system based on the energy index of sub-bands of the speech signal.
The previous studies showed that using the energy of a decomposed speech signal as features is efficient. In this study, the energy corresponding to wavelet packet nods approach has been employed for text-independent speaker recognition for the English Language Speech Database for Speaker Recognition (ELSDSR) database.
1.3 Motivation
The human brain can distinguish between two individuals based on a stored data and that is called semi-automated recognition [40]. On the other side, a fully automatic recognition is performed by a machine without interference of human. The automated recognition of a person’s identity must be depended on a strong verification code.
In past, the automated recognition of a person’s identity depended on passwords, some of letters or numbers or a mixture of both, and have been spread greatly. We found that we must save in our brain a large number of passwords. Bill Gates said at the 2007 RSA Conference: Passwords are not only weak; passwords have a huge problem... If you get more and more of them, the worse it is. One of these problems is forgotten or stolen of the password.In e-commerce millions of money lost due to the development of fraud techniques and weakness of the method based recognition. In 2006, nearly 2$ billion was lost in e-commerce sales due to security concerns of U.S. adults, according to Gartner estimates. If every online purchase depends on some form of biometric authentication to manage the person’s digital identification, only the account holder will be able to make a purchase final [41]. So there is anever need for a recognition system based on a biometric feature to reduce the crime rate and to save our life.
The matter which motivated the researchersfor building arecognition system based on a biometric feature to be able to distinguish between individuals with a high efficiency. Because it is impossible to have two persons have the same biometric features,the automated biometric recognitionsystemshave emerged as the most promising option for determining individuals’ identity instead of that based on passwords, smart cards, keys and so forth [10]. Biometrics features can be physical such as fingerprints, hand geometry and retinal pattern, and behavioral such as handwriting and voice.The best known commercialized form of voice recognition is speaker recognition [8]. It has been greatly spread as your voicecannot be forgotten or misplaced, unlike knowledge-based (e.g., password) or possession-based (e.g., key) security systems [6].More than fifty years of progress and development on the speaker recognition area, and several efforts were presented in response to the increasing needs of our technologically oriented way of life [40].
Why all these interest in building an Automated Speaker Recognition System (ASRS) with a high efficiency? If a completeASRSis produced, a very natural human-machine interface would be achieved. By natural one means something that is intuitive and easy to use by a person, a method that does not require special tools or machines but only the natural capabilities that every humanpossesses. Such a system could be used by any person able to speak and will allow an even broader use of machines, specifically computers[42].Because of the ubiquitous telephone network and microphones bundled to computers, the cost of anASRS might only be for software. As a result of that ASRS isthe most economical method for solving the problems of unauthorized use of computer and communications systems and multilevel access control. Thus it would be a high security system can be applied in many organizations such as banks, universities, and any special or public organization which requires a high degree of security as well as our life will become more secured.
1.4 Problem Statement
Even though many speaker recognition tasks reached satisfactory performance levels on specific applications and although a variety of commercial products were lunched in the last decade, many problems remain an open research area, and absolute solutions were not found out yet [43]. As the main idea behind an ASRS is to extract some features from a speaker’s voice to be used later as a reference database, the main problem in the system is what and how many features that must be usedin classification.
There are several tools for features extraction such as Fourier transform, Short time Fourier Transform(STFT). They may not be suitable for representing speech. These methods accept signal stationary within a given time frame and may therefore lack the ability to analysis localized events correctly. The wavelet transforms cope with some of these problems [44]. The Discrete Wavelet Transform (DWT) is one of the wavelet transforms whichused in the area of speaker recognition, and could be a potential features extraction tool for speaker recognition.
In this thesis, the wavelet packet transform (WPT) is applied for extracting features. This is due to the WPT can segment the frequency axis and has uniform translation in time, and then the FFBPNN is used to train and classify the feature vectors of the speaker.
1.5 Thesis Contribution
In this thesis, i have tried to enhance the efficiency of an ASRS already exist [39]. I have used the English Language Speech Database for Speaker Recognition(ELSDSR) database which composes of 23 speakers with different ages, different speaking habit, and different educational background.
In the feature extraction step of the training stage, I took 7 sentences of same text from each speaker as samples, pre-processing them, filtered the processed samples by using WPT at level 7 and a Daubechies 20-tap (db20), and then the energy corresponding to each WPT node is calculated to form a features vector (1x128) for each sentence. Thus for each speaker we have a features vector matrix of (7x128).
In the matching step, a features vector matrix of (128x161) for all speakers and a suitable target matrix are fed to the FFBPNNas input and target, respectively. I run the algorithm several times to reach the least error guided by the confusion matrix. After reaching the best result the FFBPNN is saved with its weighted and biases values.
In the testing stage, the last two steps are repeated but with changes in the feature extraction step where ihave took one sentence for each speaker andit’s text completely differsfrom that have been used in the training stage. By using the WPT at level 7 and the Daubechies 20-tap (db20), the energy corresponding to each node is calculated to form a features vector matrix of (1x128) for each speaker.
In the matching step, a features vector matrix of (128x23) for all speakers is fed to the FFBPNNas input and then running the algorithm yields an output roughly equal to the target matrix. Based on the confusion matrix, the proposed system have achieved a recognition rate equal 95.7 % and 100% if we used two testing sentences for each speaker.
A computer with specifications of processor Intel® Core™ 2 Duo CPU 2 GHZ, RAM 2 GB, an operating system windows 7 32-bit, and MATLAB (R2009b) program were used for achieving the experiment.
1.6 Thesis Outline
The first few chapters of this thesis provide some essential background and a summary of speaker recognition and techniques, wavelet transform, and neural networks:
• Chapter 2 reviews the field of speaker recognition: it provides classification of speaker recognition systems, fundamentals of speaker recognition systems. It also provides the popular speaker recognition techniques, and some applications of speaker recognition.
• Chapter 3 reviews some of the pre-processing techniques that performed before the feature extraction step.
• Chapter 4 reviews the discrete wavelet transform (DWT): gives the historical development of wavelet, a brief summary of Fourier analysis, an overview of wavelet theory, some of the most popular wavelet families, and some application of wavelets.
• Chapter 5 reviews the field of artificial neural network (ANN): gives a brief introduction of ANN, history and development of neural network, different structures, learning algorithms and types of transfer functions, the backpropagation algorithm, and the importance of neural network and its applications in real life.
• Chapter 6 presents the results, conclusion of this thesis, and suggestions for future work.
• Chapter 7introduces an application of the speaker recognition process using a speaker recognition circuit.
• Chapter 8 presents the conclusion of this thesis and suggestions for future work