Analysis and Voice Recognition in Indonesian Language Using MFCC and SVM Method

Voice recognition technology is one of biometric technology. Sound is a unique part of the human being which made an individual can be easily distinguished one from another. Voice can also provide information such as gender, emotion, and identity of the speaker. This research will record human voices that pronounce digits between 0 and 9 with and without noise. Features of this sound recording will be extracted using Mel Frequency Cepstral Coefficient (MFCC). Mean, standard deviation, max, min, and the combination of them will be used to construct the feature vectors. This feature vectors then will be classified using Support Vector Machine (SVM). There will be two classification models. The first one is based on the speaker and the other one based on the digits pronounced. The classification model then will be validated by performing 10-fold cross-validation.The best average accuracy from two classification model is 91.83%. This result achieved using Mean + Standard deviation + Min + Max as features.


INTRODUCTION
The human voice contains a lot of information such as gender, emotion, and identity of the speaker Lindasalwa et al. (2010).The purpose of the voice recognition is to identify the speaker or the words pronounced by the individual (Yee & Ahmad, 2008).Many techniques have been proposed to reduce the mismatch between testing and training environments.Most of these methods are operated in spectral domain (Lockwood & Boudy, 1992;Rosenberg, Lee, & Soong;1994) or the cepstral domain.Gracieth et al. (2014) implemented support vector machine (SVM) for automated speech digit recognition.The digit was limited to '0', '1 ', '2', '3', '4', '5', '6', '7', '8', '9' in Portuguese.The feature was extracted using Mel Frequency Cepstral Coefficients.Discrete Cosine Transform (DCT) was used to produce a two-dimensional matrix that became the input of the SVM.The study produced excellent numerical classification except for the digit '9'.Digit '1' to '8' had the best accuracy.The mean and variance were chosen as the features.Fokoué and Ma (2013) had demonstrated that the combination of MFCC and SVM produces a great tool in identifying the sex of the speaker.RBF kernel and polynomial kernel give accurate results in cross-validation.MFCC needs more time in the calculation of computing because of the complexity of the calculations.Putra and Resmawan (2011) wanted to classify gender base on the speech in Bahasa.The researcher is also using MFCC for extraction method and DTW for classification method.They collect ComTech Vol. 7 No. 2 June 2016: 131-139 speeches of 27 men and eight women.These people will speak five words and repeat it seven times.For the evaluation, Darma and Adi used the 7-fold cross validation.Based on the result, the best accuracy is 93.254% and the worst accuracy is 59.664%.This paper will discuss about the voice recognition of digit numeric '0 ', '1', '2', '3', '4', '5', '6', '7', '8', '9' in Indonesian.The human voice is converted into a digital signal form to produce digital data representing every level of the signal at each different time.Digital sound is then processed using the MFCC for extracting voice features.After that, Support Vector Machine (SVM) is used as classification method to determine the features and combinations of features that generate the most minimal error.The validation process will use 10-fold cross validation.This paper will be separated as follows: background research, the principle voice recognition, the methodology, which will be followed by the results, and conclusions are given.
After taking voice input using a microphone from the speaker, the sound will be analyzed.System design involves the manipulation of the audio signal.At some level, the operation is displayed on the input signal is pre-emphasis, framing, windowing, Mel Ceptrum analysis, and recognition of spoken words.Voice recognition algorithm includes two distinct phases.Figure 1 shows the voice algorithm.It can be shown that the first phase is the training phase while the second phase is the testing phase.

Voice Recognition Algorithms
Training Phase Each speaker has to provide samples of their voice so that the reference tamplate model can be build

Testing Phase
To ensure the input test voice is match with stored reference template model and recognition decision are made

Figure 1 Voice Recognition Algorithms
Mel Frequency Cepstral Coefficients (MFCC) algorithm is a sampling technique.MFCC is one of the most popular feature extraction techniques used in voice recognition based on the frequency domain.MFCC using the Mel scale which is based on the human ear scale.MFCC which is being considered as frequency domain features, are much more accurate than time domain features.The simplicity and ease of the procedures used to implement the method MFCC make this the most favored technique for speech recognition.
MFCC considers the sensitivity of human perception of frequency and this makes the best in voice recognition.Figure 2 shows the following steps used in MFCC.When feature extraction using MFCC in pre-emphasis block, voice signal is filtered with high pass filter.Pre-emphasis improves the voice signal and compensates the suppressed part of the signal during voice production.Then, the pre-emphasized signal is segmented into frames with an optional overlap of 1/3 until 1/2 of the frame size.This step is important to create good results because the variation of amplitude is more in larger signals compared to smaller signals.Then, framing signal will be multiplied with a Hamming window to the keep continuity of the first and last points in signal frame.Then, signal will be converted into frequency domain signal using Fast Fourier Transform.The output of Fast Fourier Transform block is multiplied by triangular band pass filters for getting log energies of each filter.
MFCC is defined as follows: F mel is a logarithmic scale of normal frequency scale.Mel-cepstral features [2], can be illustrated by MFCCs, which is calculated from the Fast Fourier Transform (FFT) power coefficient.Power coefficient filtered by triangular bandpass filter bank.When c( 5) is in the range of 250-350, the number of filters triangles that fall in the frequency range of 200-1200 Hz (the frequency range of audio information that is dominant) is higher than the other values of c.Therefore, it is efficient to set the value of c in the range to calculate MFCCs.The output is shown from the filter bank S k (k = 1, 2, …., K), then MFCCs are calculated as follows: Support Vector Machine is a statistical machine learning techniques that are useful and successfully applied in pattern recognition.The SVM classification method is based on the Structural Risk Minimization principle from computational learning theory.
Data can be separated linearly.Data provided is denoted as d whereas each class label is denoted y n {+1,−1} for 1,2, … , where is the number of data.SVM is looking for the best hyperplane that separates all data sets corresponding to the class by measuring margin hyperplane and looking for the biggest margin.Margin is the distance between the nearest hyperplane with the data from each class.The subset of the data set with the nearest distance is called a support vector.Class -1 and +1 can be completely separated by hyperplane dimension d, which is defined by the following equation which includes class -1 (negative samples) can be defined as data that meets inequality • 1 for 1.While which includes class +1 (positive samples) can be defined as data that meets inequality • 1 for 1. is normal field, and b is the position of the field about the origin.Value-defined margins is Maximum margin is obtained when the value of ||w|| minimum of hyperplane equation is • 0. Therefore, to get the biggest margin, it can be formulated as a constrained optimization problem as follows subject to • 1 0.
One method for the settlement of constraint optimization problems is by multiplying Lagrange.Thus, it can be formulated as follows subject to 0.
Then the formula (primal problem) was converted into the formula (dual problem) as follows With the above formula, then is obtained with a positive value.The value of w then obtained by the formula as follows (9) Data in which the value is more than zero is called a support vector.By knowing the support vector, the value of b can be obtained by using support vector obtained as follows 1 . (10) By recognizing the value w and b, the hyperplane equation ( 1) is obtained.After finding the hyperplane equation, the data classification into class 1, 1 can be done as follows SVM formulation for linearly separable data cannot be used for non-linearly separable data.Searching the best hyperplane can be obtained by transforming data from input space ( ) into feature space ( . Thus, the data can be separated linearly in the feature space. Dimension data in the feature space is higher than dimension data in the input space.This situation can make a very large computation in feature space.These problems can be solved by used kernel trick.By using the kernel trick, transformation functions does not need to be known.Kernel functions that are often used are Linear Kernel, Polinomial Kernel (dimension D), and Radial Basis Function (RBF) Kernel.

,
• 20 Next, the following is the equation of Polinomial Kernel (dimension D) The equation of Radial Basis Function (RBF) Kernel is Variable is hyperparameter.
Cross Validation is a method to assess the accuracy and validation of statistical models.The available dataset is divided into two parts.The first part is used to data modeling (Payam, Lei, & Huan, 2009).The data modeling from first part used to predict the values in the second part.A valid model should show good prediction accuracy.
The procedure of Cross Validation is as follows.First, the data will be divided into three sets; Training, Testing, and Validation .

METHODS
This research will be conducted in several stages.To have a better understanding, see the following Figure 6.Data is collected by recording the voices of 6 participants.Each participant will pronounce numbers between '0' to '9'.The recording process will be repeated up to 5 times; it consists of 3 times recording without noise and two times recording with noise.At the end, there will be 300 voice recording.The voice will be recorded using various devices such as smartphone, PC, and laptop with different types and specifications.The voice recording is saved with frequencies of 44.1 KHz, bit rate 16-bit and Wave Audio (WAV) file format.Each voice recording will be named by numbers pronounced, recording order, and participant name.The best accuracy for classification based on the speaker is 87.00%.This result is achieved using Mean + Standard deviation + Min as features.The worst accuracy for classification based on the speaker is 52.67%.This result is achieved using Max as features.The best accuracy for classification based on the number pronounced is 97.00%.This result is achieved using Mean + Standard deviation + Min + Max as features.The worst accuracy for classification based on the number pronounced is 60.67%.This result achieved using Standard deviation as features.The best average accuracy from both classifications is 91.83%.This result is achieved using Mean + Standard deviation + Min + Max as features.The worst average accuracy from both classifications is 60.17%.This result is achieved using Max as features.

CONCLUSIONS
Experimental results showed interesting results, feature or combination of features which have the highest accuracy in classification based on the numbers spoken is Mean + Standard Deviation + Min (87%).Feature or combination of features which have the highest accuracy in classification based on the speaker is Mean + Standard Deviation + Min + Max (97%).The best result is obtained by using combination of Mean + Standard Deviation + Min + Max.

Figure 2
Figure 2 Block Diagram to Get the Coefficients MFCC

Figure 3 First
Figure 3 First Step of Cross Validation

Figure 4
Figure 4 Second Step of Cross Validation

Figure 5
Figure 5 Third Step of Cross Validation

Figure 6
Figure 6 Diagram of Research Methods

Table 2
Experiment Result