p.p1 of making tasks automated using machines [2]. Many

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 14.5px Helvetica}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Helvetica}
p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Helvetica}
p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 9.0px Helvetica}
span.s1 {font: 24.0px Helvetica}
span.s2 {color: #002486}

Informatics Research Review
 s1736880
Abstract
Speech recognition is the translation, through
some methodologies, of human speech into text
by computers. In this research review we examine
three dierent methods that are used in speech
recognition field and we investigate the accuracy
they succeed in dierent data sets. We analyze
the state-of-art deep neural networks (DNNs),
that have evolved into complex architectures and
they achieve significant results in many cases.
Afterward, we explain convolutional neural networks
(CNNs) and we explore their dynamic in
this field. Finally, we present the recent research
in highway deep neural networks (HDNNs) that
seem to be more flexible for resource constrained
platforms. Overall, we critically try to compare
these methods and show their strengths and limitations.
We conclude that each method has its
advantages but also has its weaknesses and we
use them for dierent purposes.
I. Introduction
Machine Learning (ML) is a field of computer science
that gives the computers the ability to learn through
dierent algorithms and techniques without being programmed.
Automatic speech recognition (ASR) is closely
related with ML because it uses methodologies and procedures
of ML 1, 2, 3. ASR has been around for decades
but it was not until recently that there was a tremendous
development because of the advances in both machine learning
methods and computer hardware. New ML techniques
made speech recognition accurate enough to be useful outside
of carefully controlled environments and so it could
easily be deployed in many electronic devices nowadays
(i.e. computers, smart-phones).
Speech is the most important mode of communication
between human beings and that is why from the early part
of the previous century, eorts have been made in order
to make computers do what only humans could perceive.
Research has been conducted through the past five decades
and the main reason was the desire of making tasks automated
using machines 2. Many motivations using dierent
theories such as probabilistic modeling and reasoning,
pattern recognition and artificial neural networks aected
the researchers and helped to advance ASR.
The first single advance in the history of ASR occurred
in the middle of 70’s with the introduction of the
expectation-maximization (EM) 4 algorithm for training
hidden Markov models (HMMs). The EM technique gave
the possibility to develop the first speech recognition systems
using Gaussian mixture models (GMMs). Despite
all the advantages of the GMMs, they are statistically ine
cient for modeling data that lie on or near a nonlinear
manifold in the data space. This problem could be solved
by artificial neural networks but the computer hardware of
that era did not allow us to build complex neural networks.
As a result most speech recognition systems were based
on HMMs and later they used the neural network and hidden
Markov model (NN/HMM) hybrid architecture, first
investigated in the early 1990s 5. After 2000s and over
the last years the improvement of computer hardware and
the invention of new machine learning algorithms made
possible the training for DNNs. DNNs with many hidden
layers have been shown to outperform GMMs on a variety
of speech recognition benchmarks 6. Other more complex
neural architectures such as recurrent neural networks
with long short-term memory units (LSTM-RNNs) 7 and
CNNs seem to have their benefits and applications.
In this literature review we present three types of artificial
neural networks (DNNs, CNNs, and HDNNs). We
analyze each method, we explain how they are used for
training and what are their advantages and disadvantages.
Finally we compare these methods, identifying where each
one of them is more suitable and what are their limitations
in the context of ASR. Furthermore we draw some conclusions
from these comparisons and we carefully suggest
some probable future directions.
II. Methods
 A. Deep Neural Networks
DNNs are feed-forward artificial neural networks with
more than one layer of hidden units. Each hidden
layer has a number of units (or neurons) each of which
takes all outputs of the lower layer as input, multiplies them
by a weight vector, sums the result and passes it through
a non linear activation function (i.e. sigmoid function, hyperbolic
tangent function, some kind of rectified linear unit
function (ReLU 8, 9), or exponential linear unit function
(ELU 10)). For a multi-class classification problem, the
posterior probability of each class can be estimated using
an output softmax layer. DNNs can be discriminatively
trained by propagating derivatives of a cost function that
measures the discrepancy between the target outputs and
the actual outputs. For large training sets, it is typically
more ecient to compute derivatives on a mini-batch of
the training cases rather than the whole training set (this
Informatics Research Review (s1736880)
is called stochastic gradient descent). As cost function we
often use the cross-entropy (CE) but this actually depends
on the case.
The diculty to optimize DNNs with many hidden
layers along with overfitting problem force us to use pretraining
methods. One such a popular method is a restricted
Boltzmann machine (RBM) as the authors describe in the
overview paper 6. If we use a stack of RBMs then we can
construct a deep belief network (DBN) (not the same with
dynamic Bayesian network). The purpose of this is to add
an initial stage of generative pretraining. The pretraining
is very important for deep neural networks because it reduces
overfitting and it also reduces the time required for
discriminative fine-tuning with propagation.
DNNs in the context of ASR play a major role. Many
architectures have been used by dierent research groups in
order to gain better and better accuracy in acoustic models.
You can see some methodologies in the article 6 that it
presents some significant results and shows that DNNs outperform
GMMs on a variety of speech recognition benchmarks,
and sometimes by a large margin. The main reason
is that they take advantage from the fact that they can learn
much better models of data that lie on or near a non-linear
manifold. However, we have to mention that they use many
model parameters in order to achieve a good enough speech
accuracy and this is sometimes a drawback. Furthermore,
they are complex enough and need many computational
resources. Finally, they have been criticized as lacking
structure, being resistant to interpretation and possessing
limited adaptability.
 B. Convolutional Neural Networks
Convolutional neural networks (CNNs) can be regarded
as DNNs with the main dierence that instead of using
fully connected hidden layers (as it happens in DNNs)
they use a special network structure, which consists of convolution
and pooling layers 11, 12, 13. Basic rule is that
the data have to be organized as a number of feature maps
(CNNs firstly used for image recognition) in order to be fed
in CNN. The first problem concerns frequency because we
are not able to use the conventional mel-frequency cepstral
coecient (MFCC). The reason is that the discrete cosine
transform projects the spectral energies into a new basis
that may not contain locality. We want to preserve locality
in both axes of frequency and time. Hence, a solution is
the usage of mel-frequency spectral coecients (MFSC
features) 13. You can see the use of convolution process
and the pooling layers in the same paper 13. Our main
purpose is to learn the weights that are being shared among
the convolutional layers. Moreover, as it happens for DNNs
with RBMs there is a corresponding procedure CRBM 14
for CNNs that allow us pretraining. In the paper 13 the
authors also examine the case of a CNN with limited weight
sharing for ASR (LWS model) and they propose to pretrain
it modifying the CRBM model.
The CNN has three key properties: locality, weight
sharing, and pooling. Each one of them has the potential to
improve speech recognition performance. We are cared for
locality because it adds robustness against non-white noise
where some bands are cleaner than others. It also reduces
the number of network weights to be learned. Weight sharing
is also important in CNNs because it can improve model
robustness and reduce overfitting. Moreover, it reduces the
number of weights that need to be learned. Both locality
and weight sharing are need for the property of pooling.
Pooling is very helpful in handling small frequency shifts
that are common in speech signals. This frequency shifts
may result from dierences in vocal tract lengths among
dierent speakers. In general, CNNs seem to have a relative
better performance in ASR taking advantage from their
special network structure.
 C. Highway Deep Neural Networks
HDNNs are a depth-gated feed-forward neural network
15. They are distinguished from the conventional
DNNs for two main reasons. Firstly they use much less
model parameters and secondly they use two types of gate
functions to facilitate the information flow through dierent
layers.
HDNNs are a multi-layer network with many hidden
layers. In each layer we have the transformation of the input
or of the previous hidden layer with the corresponding
parameter of the current layer followed by a non-linear
activation function (i.e. sigmoid function). The output
layer is parameterized with the parameter and we usually
use the softmax function as the output function in order to
obtain the posterior probability of each class given the input
feature. Afterward, given the target labels, the network
is usually trained by gradient descent to minimize a loss
function such as cross-entropy. This architecture is the
same as in DNNs.
The dierence from the standard DNNs is that highway
deep neural networks (HDNNs) were proposed to enable
very deep networks to be trained by augmenting the hidden
layers with gate functions 16. These are the transform
gate that scales the original hidden activations and the carry
gate, which scales the input before passing it directly to the
next hidden layer as the authors describe in the paper 15.
In the same paper 15 three main methods are presented
for training, the sequence training, the adaptation
technique and the teacher-student training. Combining
these methodologies with the two gates the authors demonstrate
how important role the carry and the transform gate
play in the training. They allow us to achieve comparable
speech recognition accuracy to the classic DNNs but
with much less model parameters. This result is crucial for
resource-constrained platforms such as mobile devices.
 D. Comparison of the Methods
These methods, that we presented, have their benefits and
weaknesses. In general, DNNs outperform GMMs
on a variety of speech recognition benchmarks. The main
Informatics Research Review (s1736880)
reason is that they take advantage from the fact that they can
learn much better models of data that lie on or near a nonlinear
manifold. On the other hand, the biggest drawback
of DNNs compared with GMMs is that it is much harder to
make good use of large cluster machines to train them on
massive data sets.
CNNs can handle frequency shifts that are dicult to
be handled within other models such as GMMs and DNNs.
Furthermore, it is also dicult to learn such an operation as
max-pooling in standard ANN. CNNs can also handle the
temporal variability in the speech features as well. On the
other hand, since the CNN is required to compute an output
for each frame for decoding, pooling or shift size may aect
the fine resolution seen by higher layers of the CNN, and a
large pooling size may aect state labelsâ?A
´Z
localizations.
This may cause phonetic confusion, especially at segment
boundaries. Hence, a suitable pooling size must be chosen.
HDNNs are considered to be more compact than regular
DNNs due to the fact that they can achieve similar
recognition accuracy with many fewer model parameters.
Furthermore, they are more controllable than DNNs and
this is because through the gate functions we can control the
behavior of the whole network using a very small number of
model parameters (the parameters of the gates). Moreover,
HDNNs also more controllable because the author in paper
15 show that simply updating the gate functions using
adaptation data they can gain considerably in speech recognition
accuracy. Although that HDNNs obtain comparable
word error rate (WER) with much smaller acoustic models
the number of model parameters is still relatively large
compared to the amount of data typically used for speaker
adaptation. So this adaptation technique may be more applicable
to domain adaptation, where the expectation amount
of adaptation data is larger.
III. Conclusion
References
1 Li Deng, , and Xiao Li. Machine learning paradigms
for speech recognition: An overview. IEEE Transactions
on Audio, Speech, and Language Processing,
2013. doi: 10.1109/TASL.2013.2244083.
2 Jayashree Padmanabhan and Melvin Jose Johnson
Premkumar. Machine learning in automatic speech
recognition: A survey. IETE Technical Review, 2015.
doi: 10.1080/02564602.2015.1010611.
3 Inge Gavat and Diana Militaru. New trends in machine
learning for speech recognition. Acta Electrotehnica,
2016.
4 Bing-Hwang Juang, S. Levinson, and M. Sondhi.
Maximum likelihood estimation for multivariate mixture
observations of markov chains. IEEE Transactions
on Information Theory, pages 307–309, 1986.
doi: 10.1109/TIT.1986.1057145.
5 Steve Renals, Nelson Morgan, Herve
Bourlard Michael Cohen, and Horacio Franco.
Connectionist probability estimators in hmm speech
recognition. IEEE Transactions on Speech and Audio
Processing, 1994. doi: 10.1109/89.260359.
6 Georey Hinton, Li Deng, Dong Yu, George E.
Dahl, Abdel rahman Mohamed, Navdeep Jaitly, Andrew
Senior, Vincent Vanhoucke, Patrick Nguyen,
Tara N. Sainath, and Brian Kingsbury. Deep
neural networks for acoustic modeling in speech
recognition. IEEE Signal Processing, 2012. doi:
10.1109/MSP.2012.2205597.
7 Hasim Sak, Andrew Senior, and Francoise Beaufays.
Long short-term memory recurrent neural network
architectures for large scale acoustic modeling.
arXiv:1402.1128, 2017.
8 Vinod Nair and Georey E Hinton. Rectified linear
units improve restricted boltzmann machines. In Proc
ICML, pages 807–814, 2010.
9 Andrew L Maas, Awni Y Hannun, and Andrew Y
Ng. Rectifier nonlinearities improve neural network
acoustic models. In Proc ICML, 2013.
10 Djork-Arne Clevert, Thomas Unterthiner, and Sepp
Hochreiter. Fast and accurate deep network learning
by exponential linear units (elus). arXiv preprint
arXiv:1511.07289, 2015.
11 Honglak Lee, Yan Largman, Peter Pham, and Andrew
Y. Ng. Unsupervised feature learning for audio
classification using convolutional deep belief networks.
In Proc. Adv. Neural Inf. Process 22, pages
1096–1104, 2009.
12 Darren Hau and Ke Chen. Exploring hierarchical
speech representations with a deep convolutional neural
network. In Proceedings of UKCI, 2011.
13 Ossama Abdel-Hamid, Abdel rahman Mohamed,
Hui Jiang, Li Deng, Gerald Penn, and Dong
Yu. Convolutional neural networks for speech
recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 2017. doi:
10.1109/TASLP.2014.2339736.
14 H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional
deep belief networks for scalable unsupervised
learning of hierarchical representations. In Proc.
26th Annu. Int. Conf. Mach. Learn., pages 609–616,
2009.
15 Liang Lu and Steve Renals. Small-footprint
highway deep neural networks for speech
recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 2017. doi:
10.1109/TASPL.2017.2698723.
16 Rupesh Kumar Srivastava, Klaus Gre, and Jurgen
Schmidhuber. Training very deep networks. In Proc
NIPS, 2015.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now
x

Hi!
I'm Marcella!

Would you like to get a custom essay? How about receiving a customized one?

Check it out