p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.

5px Helvetica}p.p2 {margin: 0.0px 0.0px 0.

0px 0.0px; font: 10.0px Helvetica}p.p3 {margin: 0.0px 0.0px 0.

0px 0.0px; font: 12.0px Helvetica}p.

p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.0px Helvetica}span.

s1 {font: 24.0px Helvetica}span.s2 {color: #002486}Overview of Three Dierent Structures of Artificial Neural Networks forSpeech Recognition s1736880AbstractAutomatic speech recognition (ASR) is the translation,through some methodologies, of humanspeech into text by machines and plays an importantrole nowadays. In this research reviewwe examine three dierent artificial neural networkarchitectures that are used in speech recognitionfield and we investigate their performancein dierent cases. We analyze the state-of-artdeep neural networks (DNNs), that have evolvedinto complex structures and they achieve significantresults in a variety of speech benchmarks.Afterward, we explain convolutional neural networks(CNNs) and we explore their potential inthis field. Finally, we present the recent researchin highway deep neural networks (HDNNs) thatseem to be more flexible for resource constrainedplatforms. Overall, we critically try to comparethese methods and show their strengths and limitations.

Each method has its benefits and applicationsand from them we try to draw someconclusions and give some potential future directions.I. IntroductionMachine Learning (ML) is a field of computer sciencethat gives the computers the ability to learn throughdierent algorithms and techniques without being programmed.

ASR is closely related with ML because it usesmethodologies and procedures of ML 1, 2, 3. ASR hasbeen around for decades but it was not until recently thatthere was a tremendous development because of the advancesin both machine learning methods and computerhardware. New ML techniques made speech recognitionaccurate enough to be useful outside of carefully controlledenvironments, so it could easily be deployed in many electronicdevices nowadays (i.e. computers, smart-phones)and be used in many applications such as identifying andauthenticating a user via of his/her voice.Speech is the most important mode of communicationbetween human beings and that is why from the early partof the previous century, eorts have been made in orderto make machines do what only humans could perceive.

Research has been conducted through the past five decadesand the main reason was the desire of making tasks automatedusing machines 2. Many motivations using dierenttheories such as probabilistic modeling and reasoning,pattern recognition and artificial neural networks aectedthe researchers and helped to advance ASR.The first single advance in the history of ASR occurredin the middle of 70’s with the introduction of theexpectation-maximization (EM) 4 algorithm for traininghidden Markov models (HMMs). The EM technique gavethe possibility to develop the first speech recognition systemsusing Gaussian mixture models (GMMs). Despite allthe advantages of the GMMs, they are not able to modeleciently data that lie on or near a nonlinear surface in thedata space (i.e. sphere).

This problem could be solved byartificial neural networks because they can capture thesenon-linearities in the data but the computer hardware ofthat era did not allow us to build complex neural networks.As a result, in the beginning most speech recognition systemswere based on HMMs. Later the neural network andhidden Markov model (NN/HMM) hybrid architecture 5was used for ASR systems. After 2000s and over the lastyears the improvement of computer hardware and the inventionof new machine learning algorithms made possible thetraining for DNNs. DNNs with many hidden layers havebeen shown to achieve comparable and sometimes muchbetter performance than GMMs in many dierent databases(with speech data) and in a range of applications 6.

Afterthe huge success of DNNs, researchers try other artificialneural architectures such as recurrent neural networks withlong short-term memory units (LSTM-RNNs) 7, deepbelief networks and CNNs, and it seems that each one ofthem has its benefits and weaknesses.In this literature review we present three types of artificialneural networks (DNNs, CNNs, and HDNNs). Weanalyze each method, we explain how they are used fortraining and what are their advantages and disadvantages.Finally we compare these methods in the context of ASR,identifying where each one of them is more suitable andwhat are their limitations. Finally, we draw some conclusionsfrom these comparisons and we carefully suggestsome probable future directions.II. Methods A. Deep Neural NetworksDNNs are feed-forward artificial neural networks withmore than one layer of hidden units.

Each hiddenlayer has a number of units (or neurons) each of whichInformatics Research Review (s1736880)takes all outputs of the lower layer as input and passes themthrough a linearity. After that we apply a non linear activationfunction (i.e. sigmoid function, hyperbolic tangentfunction, some kind of rectified linear unit function (ReLU8, 9), or exponential linear unit function (ELU 10)) forthe final transformation of our initial inputs. Sometimes, fora multi-class classification problem, the posterior probabilityof each class can be estimated using an output softmaxlayer. For the training process of DNNs we usually use theback propagation technique 11.

For large training sets, itis typically more convenient to compute derivatives on amini-batch of the training set rather than the whole trainingset (this is called stochastic gradient descent). As cost functionwe often use the cross-entropy (CE) in order to have acomparison meter between the output of the network andthe actual output but the choice of the cost function actuallydepends on the case.The diculty to optimize DNNs with many hiddenlayers along with overfitting problem force us to use pretrainingmethods. One such a popular method is to usethe restricted Boltzmann machines (RBMs) 12. If weuse a stack of RBMs then we can construct a deep beliefnetwork (DBN) (you should not be confused with dynamicBayesian network). The purpose of this is to add an initialstage of generative pretraining.

The pretraining is veryimportant for DNNs because it reduces overfitting and italso reduces the time required for discriminative fine-tuningwith propagation.DNNs in the context of ASR play a major role. Manyarchitectures have been used by dierent research groups inorder to gain better and better accuracy in acoustic models.

You can see some methodologies in the article 6 that itpresents some significant results and shows that DNNs ingeneral achieve higher speech recognition accuracy thanGMMs on a variety of speech recognition benchmarks suchas TIMIT and some other large vocabulary environments.The main reason is that they take advantage from the factthat they can handle the non-linearities in the data and sothey can learn much better models comparing to GMMs.However, we have to mention that they use many model parametersin order to achieve a good enough speech accuracyand this is sometimes a drawback. Furthermore, they arecomplex enough and need many computational resources.

Finally, they have been criticized because they do not preservesome specific structure (we can use dierent structuresuntil we achieve a significant speech accuracy), theyare dicult to be interpreted (because they have not somespecific structure) and they possess limited adaptability (weuse dierent approaches for dierent cases). Besides allof these disadvantages they remain the state-of-the-art forspeech recognition the last few years and they have givenus the most reliable and consistent results overall. B. Convolutional Neural NetworksConvolutional neural networks (CNNs) can be regardedas DNNs with the main dierence that instead of usingfully connected hidden layers (as it happens in DNNs; fullconnection with all the possible combinations among thehidden layers) they use a special network structure, whichconsists of convolution and pooling layers 13, 14, 15. Basicrule is that the data have to be organized as a numberof feature maps in order to be passed in each convolutionallayer. One significant problem we have when we want totransform our speech data in feature maps concerns frequencybecause we are not able to use the conventionalmel-frequency cepstral coecient (MFCC) technique 16.The reason is that this technique does not preserve the localityof our data (in the case of CNNs), although we wantto preserve locality in both frequency and time.

Hence, asolution is the use of mel-frequency spectral coecients(MFSC features) 15.Our purpose with MFSC technique is to form the inputfeature maps without loosing the property of locality in ourdata. Then we can apply the convolution and pooling layerswith their respective operations to generate the activationsof the units in those layers.

We should mention that eachinput feature map is connected to many feature maps andthe feature maps share the weights. Thus, firstly, we usethe convolution operation to construct our convolutionallayers and afterwards, we apply the pooling layer in orderto reduce the resolution of the feature maps. This processcontinues depending on how deep we want to be our network(maybe we could achieve higher speech accuracy withmore layers on this structure or maybe not). You can seethe whole process and the usage of convolution and poolinglayers in the paper 15. Moreover, as it happens for DNNswith RBMs, there is a respective procedure CRBM 17 forCNNs that allow us pretraining our data in order to gainin speech accuracy and reduce the overfitting eect. In thepaper 15, the authors also examine the case of a CNNwith limited weight sharing for ASR (LWS model) and theypropose to pretrain it modifying the CRBM model.CNNs have three major properties: locality, weightsharing, and pooling. Each one of them has the potentialto improve speech recognition performance.

These propertiescan reduce the overfitting problem and they can addrobustness against non-white noise. In addition, they canreduce the number of network weights to be learned. Bothlocality and weight sharing are significant factors for theproperty of pooling which is very helpful in handling smallfrequency shifts that are common in speech signals. Theseshifts may occur from dierences in vocal tract lengthsamong dierent speakers 15. In general, CNNs seem tohave a relative better performance in ASR taking advantagefrom their special network structure. C. Highway Deep Neural NetworksHDNNs are depth-gated feed-forward neural networks18. They are distinguished from the conventionalDNNs for two main reasons.

Firstly they use much lessmodel parameters and secondly they use two types of gatefunctions to facilitate the information flow through the hiddenlayers.Informatics Research Review (s1736880)HDNNs are multi-layer networks with many hiddenlayers. In each layer we have the transformation of theinitial input or of the previous hidden layer with the correspondingparameter of the current layer (they are combinedin a linear way) followed by a non-linear activation function(i.e.

sigmoid function). The output layer is parameterizedwith the parameter and we usually use the softmax functionas the output function in order to obtain the posterior probabilityof each class given our initial inputs. Afterwards,given the target outputs, the network is usually trained bygradient descent to minimize a loss function such as crossentropy(CE function).

So, we can see that the architectureand the process are the same as in DNNs that we describedin subsection of DNNs.The dierence from the standard DNNs is that highwaydeep neural networks (HDNNs) were proposed to enablevery deep networks to be trained by augmenting the hiddenlayers with gate functions 19. This augmentation happensthrough the transform and carry gate functions. The firstscales the original hidden activations and the latter scalesthe input before passing it directly to the next hidden layer18.Three main methods are presented for training, thesequence training, the adaptation technique and the teacherstudenttraining in the papers 18, 20, 21. Combining thesemethodologies with the two gates it is demonstrated howimportant role the carry and the transform gate play in thetraining.

The main reason is that the gates are responsibleto control the flow of the information among the hiddenlayers. They allow us to achieve comparable speech recognitionaccuracy to the classic DNNs but with much lessmodel parameters because we have the ability to handle thewhole network through the parameters of the gate functions(which are much less comparing to the parameters of thewhole network). This outcome is crucial for platforms suchas mobile devices (i.e. voice recognition on mobiles) due tothe fact that we have not many disposal resources in thesedevices. D. Comparison of the MethodsThese methods, that we presented, have their benefits andlimitations.

In general, DNNs behave very well and inmany cases they have enough better performance comparedto GMMs on a range of applications. The main reason isthat they take advantage from the fact that they can handlemuch better the non linearities in the data space. On theother hand, their biggest drawback compared with GMMsis that it is much harder to make good use of large clustermachines to train them on massive data 6.As far as the CNNs are concerned, they can handlefrequency shifts which are dicult to be handled withinother models such as GMMs and DNNs.

Furthermore, it isalso dicult to learn such an operation as max-pooling instandard artificial neural networks. Moreover, CNNs canhandle the temporal variability in the speech features aswell 15. On the other hand, the fine-tuning of the poolingsize (carefully selection of pooling size) is very importantbecause otherwise we may cause phonetic confusion, especiallyat segment boundaries. Despite the fact that CNNsseem to have better accuracy than DNNs with less parameters,computationally are more expensive because of thecomplexity of the convolution operation.HDNNs are considered to be more compact than regularDNNs due to the fact that they can achieve similarrecognition accuracy with many fewer model parameters.Furthermore, they are thinner than DNNs and this is becausethrough the gate functions we can control the behaviorof the whole network using only the parameters of thegates (which are much less comparing to the parameters ofthe whole network). Moreover, with HDNNs we can updateour whole model by simply updating the gate functionsusing adaptation data and in this way we can gain considerablyin speech recognition accuracy. Although they areconsidered useful for resource constrained platforms, theirfinal model parameters are still large enough.

So, we cannotconclude much for their general performance because theyare a recent proposal and it is needed more research to seetheir overall benefits and limitations. However, the mainidea is to use them in order to have comparable ASR accuracywith DNNs and simultaneously to reduce the modelparameters.III. ConclusionsOverall, we can say that DNNs are the state-of-thearttoday because they behave very well on a rangeof speech recognition benchmarks.

However, other architecturesof artificial neural networks such as CNNs haveachieved comparable performance in the context of ASR.Besides that, research continues to be conducted in thisfield in order to find new methods, learning techniquesand architectures that will allow us to train our data setsmore eciently. This means less parameters, less computationalpower, less complex models, more structured models.Ideally we would like to have a whole general model thatcovers a lot of cases and not many dierent models thatapplied in dierent circumstances.

On the other hand, thisis probably dicult, so just distinct methodologies and specifictechniques for dierent cases may be our temporary oreven more our unique solution. In this direction, HDNNs orother architectures may be used to deal with specific cases.Many future directions have been suggested the lastfew years for research in order to advance ASR. Someprobable suggestions are the use of unsupervised learningor reinforcement learning for acoustic models. Anotherpotential direction is to search for new architectures orspecial structures in artificial neural networks or inventingnew learning techniques and


I'm Erica!

Would you like to get a custom essay? How about receiving a customized one?

Check it out