Ткани для защитной спецодежды;pdf

Relative Density Nets: A New Way to
Combine Bakpropagation with HMM's
Andrew D. Brown
Department of Computer Siene
University of Toronto
Toronto, Canada M5S 3G4
Georey E. Hinton
Gatsby Unit, UCL
London, UK WC1N 3AR
hintongatsby.ul.a.uk
andys.utoronto.a
Abstrat
Logisti units in the rst hidden layer of a feedforward neural network ompute the relative probability of a data point under two
Gaussians. This leads us to onsider substituting other density
models. We present an arhiteture for performing disriminative
learning of Hidden Markov Models using a network of many small
HMM's. Experiments on speeh data show it to be superior to the
standard method of disriminatively training HMM's.
1
Introdution
A standard way of performing lassiation using a generative model is to divide the
training ases into their respetive lasses and then train a set of lass onditional
models. This unsupervised approah to lassiation is appealing for two reasons. It
is possible to redue overtting, beause the model learns the lass-onditional input
densities P (xj) rather than the input-onditional lass probabilities P (jx). Also,
provided that the model density is a good math to the underlying data density
then the deision provided by a probabilisti model is Bayes optimal. The problem
with this unsupervised approah to using probabilisti models for lassiation is
that, for reasons of omputational eÆieny and analytial onveniene, very simple
generative models are typially used and the optimality of the proedure no longer
holds. For this reason it is usually advantageous to train a lassier disriminatively.
In this paper we will look speially at the problem of learning HMM's for lassifying speeh sequenes. It is an appliation area where the assumption that the HMM
is the orret generative model for the data is inaurate and disriminative methods
of training have been suessful. The rst setion will give an overview of urrent
methods of disriminatively training HMM lassiers. We will then introdue a new
type of multi-layer bakpropagation network whih takes better advantage of the
HMM's for disrimination. Finally, we present some simulations omparing the two
methods.
Classes
1
1
1
HMM’s
Sequence
Figure 1: An Alphanet with one HMM per lass. Eah omputes a sore for the
sequene and this feeds into a softmax output layer.
2
Alphanets and Disriminative Learning
The unsupervised way of using an HMM for lassifying a olletion of sequenes is to
use the Baum-Welh algorithm [1℄ to t one HMM per lass. Then new sequenes
are lassied by omputing the probability of a sequene under eah model and
assigning it to the one with the highest probability. Speeh reognition is one of the
ommonest appliations of HMM's, but unfortunately an HMM is a poor model of
the speeh prodution proess. For this reason speeh researhers have looked at the
possibility of improving the performane of an HMM lassier by using information
from negative examples | examples drawn from lasses other than the one whih
the HMM was meant to model. One way of doing this is to ompute the mutual
information between the lass label and the data under the HMM density, and
maximize that objetive funtion [2℄.
It was later shown that this proedure ould be viewed as a type of neural network
(see Figure 1) in whih the inputs to the network are the log-probability sores
L(x1:T jH) of the sequene under hidden Markov model H [3℄. In suh a model
there is one HMM per lass, and the output is a softmax non-linearity:
exp(L(x1:T jHk ))
P (k jx1:T ; H1 ; : : : ; HK ) = yk = PK
(1)
j =1 exp(L(x1:T jHj ))
Training this model by maximizing the log probability of orret lassiation leads
to a lassier whih will perform better than an equivalent HMM model trained
solely in a unsupervised manner. Suh an arhiteture has been termed an \Alphanet" beause it may be implemented as a reurrent neural network whih mimis
the forward pass of the forward-bakward algorithm.1
3
Bakpropagation Networks as Density Comparators
A multi-layer feedforward network is usually thought of as a exible non-linear
regression model, but if it uses the logisti funtion non-linearity in the hidden
layer, there is an interesting interpretation of the operation performed by eah
hidden unit. Given a mixture of two Gaussians where we know the omponent
priors P (G ) and the omponent densities P (xjG ) then the posterior probability that
Gaussian, G0 , generated an observation x, is a logisti funtion whose argument is
the negative log-odds of the two lasses [4℄. This an learly be seen by rearranging
1
The results of the forward pass are the probabilities of the hidden states onditioned
on the past observations, or \alphas" in standard HMM terminology.
the expression for the posterior:
(xjG0 )P (G0 )
P (xjG0 )P (G0 ) + P (xjG1 )P (G1 )
1
o
n
=
(2)
P (xjG )
1 + exp log P (xjG ) log PP ((GG ))
If the lass onditional densities in question are multivariate Gaussians
T
1
1
exp 2 (x k ) (x k )
P (xjGk ) = j2 j
(3)
with equal ovariane matries, , then the posterior lass probability may be
written in this familiar form:
1
(4)
P (G0 jx) =
1 + expf (xT w + b)
P
P
(G0 jx) =
0
0
1
1
1
2
where,
(5)
(
G
)
0
0 ) + log
(6)
P (G1 )
Thus, the multi-layer pereptron an be viewed as omputing pairwise posteriors
between Gaussians in the input spae, and then ombining these in the output layer
to ompute a deision.
4
w = 1 (0
)
b = (1 + 0 )T (1
1
P
A New Kind of Disriminative Net
This view of a feedforward network suggests variations in whih other kinds of
density models are used in plae of Gaussians in the input spae. In partiular,
instead of performing pairwise omparisons between Gaussians, the units in the
rst hidden layer an perform pairwise omparisons between the densities of an
input sequene under M dierent HMM's. For a given sequene the log-probability
of a sequene under eah HMM is omputed and the dierene in log-probability
is used as input to the logisti hidden unit.2 This is equivalent to omputing the
posterior responsibilities of a mixture of two HMM's with equal prior probabilities.
In order to maximally leverage the information aptured by the HMM's we use M2
hidden units so that all possible pairs are inluded. The output of a hidden unit h
is given by
h(mn) = (L(x1:T jHm ) L(x1:T jHn ))
(7)
M
where we have used (mn) as an index over the set, 2 , of all unordered pairs of
the HMM's. The results of this hidden layer omputation are then ombined using
a fully onneted layer of free weights, W , and nally passed through a softmax
funtion to make the nal deision.
X
ak =
w(m;n)k h(mn)
(8)
M
(mn)2( )
exp(ak )
P (k jx1:T ; H1 ; : : : ; HM ) = pk = PK
(9)
k0 =1 exp(ak0 )
2
2
We take the time averaged log-probability so that the sale of the inputs is independent
of the length of the sequene.
Classes
−1
+1
−1
+1
Density
Comparator
Units
+1
−1
HMM’s
Sequence
Figure 2: A multi-layer density net with HMM's in the input layer. The hidden
layer units perform all pairwise omparisons between the HMM's.
where we have used () as shorthand for the logisti funtion, and pk is the value
of the kth output unit. The resulting arhiteture is shown in gure 2. Beause
eah unit in the hidden layer takes as input the dierene in log-probability of two
HMM's, this an be thought of as a xed layer of weights onneting eah hidden
unit to a pair of HMM's with weights of 1.
In ontrast to the Alphanet, whih alloates one HMM to model eah lass, this network does not require a one-to-one alignment between models and lasses and it gets
maximum disriminative benet from the HMM's by omparing all pairs. Another
benet of this arhiteture is that it allows us to use more HMM's than there are
lasses. The unsupervised approah to training HMM lassiers is problemati beause it depends on the assumption that a single HMM is a good model of the data
and, in the ase of speeh, this is a poor assumption. Training the lassier disriminatively alleviated this drawbak and the multi-layer lassier goes even further in
this diretion by allowing many HMM's to be used to learn the deision boundaries
between the lasses. The intuition here is that many small HMM's an be a far
more eÆient way to haraterize sequenes than one big HMM. When many small
HMM's ooperate to generate sequenes, the mutual information between dierent
parts of generated sequenes sales linearly with the number of HMM's and only
logarithmially with the number of hidden nodes in eah HMM [5℄.
5
Derivative Updates for a Relative Density Network
The learning algorithm for an RDN is just the bakpropagation algorithm applied
to the network arhiteture as dened in equations 7,8 and 9. The output layer is
a distribution over lass memberships of data point x1:T , and this is parameterized
as a softmax funtion. We minimize the ross-entropy loss funtion:
`
=
K
X
k=1
tk
log pk
(10)
where pk is the value of the kth output unit and tk is an indiator variable whih is
equal to 1 if k is the true lass. Taking derivatives of this expression with respet
to the inputs of the output units yields
`
= tk pk
(11)
a
k
`
ak
`
=
= (tk pk )h(mn)
(12)
w(mn);k
ak w(mn);k
The derivative of the output of the (mn)th hidden unit with respet to the output
of ith HMM, Li , is
h(mn)
= (Lm Ln )(1 (Lm Ln ))(Æim Æin )
(13)
Li
where (Æim Æin ) is an indiator whih equals +1 if i = m, 1 if i = n and zero
otherwise. This derivative an be hained with the the derivatives bakpropagated
from the output to the hidden layer.
For the nal step of the bakpropagation proedure we need the derivative of the
log-likelihood of eah HMM with respet to its parameters. In the experiments we
use HMM's with a single, axis-aligned, Gaussian output density per state. We use
the following notation for the parameters:
A: aij is the transition probability from state i to state j
: i is the initial state prior
i : mean vetor for state i
vi : vetor of varianes for state i
H: set of HMM parameters fA; ; ; vg
We also use the variable st to represent the state of the HMM at time t. We make
use of the property of all latent variable density models that the derivative of the
log-likelihood is equal to the expeted derivative of the joint log-likelihood under
the posterior distribution. For an HMM this means that:
X
L(x1:T jH)
P (s1:T jx1:T ; H)
=
log P (x1:T ; s1:T jH)
(14)
H
H
i
i
s1:T
The joint likelihood of an HMM is:
hlog P (x1:T ; s1:T jH)i =
=
X
hÆs ;i i log i +
T X
X
"
hÆst ;j Æst ;i i log aij +
1
1
i
t=2 i;j
#
D
D
X
X
1
1
hÆst ;i i 2 log vi;d 2 (xt;d i;d )2 =vi;d + onst (15)
t=1 i
d=1
d=1
where hi denotes expetations under the posterior distribution and hÆst ;i i and
hÆst ;j Æst ;i i are the expeted state oupanies and transitions under this distribution. All the neessary expetations are omputed by the forward bakward algorithm. We ould take derivatives with respet to this funtional diretly, but that would require doing onstrained gradient desent on the probabilities and the varianes. Instead, we reparameterize the model using a
softmax basis for probability vetors and an exponential basis for the variane parameters. This hoie of basis allows us to do unonstrained optimization in the new basis.
The new parameters are dened as follows:
a
exp(ij )
exp(i )
(v)
aij = P
a ) ; i = P exp( ) ; vi;d = exp(i;d )
exp(
i
ij0
j0
i0
This results in the following derivatives:
T
X
L(x1:T jH)
(16)
hÆ Æ i hÆ ia
=
T X
X
1
( )
( )
( )
(a)
ij
( )
t=2
st ;j st
1 ;i
st
1 ;i
ij
L(x T jH) = hÆ i
s ;i
1:
( )
i
L(x T jH) =
1:
i;d
1
T
X
h
Æst ;i
t=1
T
X
(17)
i
i(xt;d
)
i;d =vi;d
(18)
L(x T jH) = 1 hÆ i (x ) =v 1
(19)
v
2 t st ;i t;d i;d i;d
i;d
When hained with the error signal bakpropagated from the output, these derivatives give us the diretion in whih to move the parameters of eah HMM in order
to inrease the log probability of the orret lassiation of the sequene.
1:
( )
6
2
=1
Experiments
To evaluate the relative merits of the RDN, we ompared it against an Alphanet
on a speaker identiation task. The data was taken from the CSLU 'Speaker
Reognition' orpus. It onsisted of 12 speakers uttering phrases onsisting of 6
dierent sequenes of onneted digits reorded multiple times (48) over the ourse
of 12 reording sessions. The data was pre-emphasized and Fourier transformed
in 32ms frames at a frame rate of 10ms. It was then ltered using 24 bandpass,
mel-frequeny saled lters. The log magnitude lter response was then used as the
feature vetor for the HMM's. This pre-proessing redued the data dimensionality
while retaining its spetral struture.
While mel-epstral oeÆients are typially reommended for use with axis-aligned
Gaussians, they destroy the spetral struture of the data, and we would like to
allow for the possibility that of the many HMM's some of them will speialize on
partiular sub-bands of the frequeny domain. They an do this by treating the
variane as a measure of the importane of a partiular frequeny band | using
large varianes for unimportant bands, and small ones for bands to whih they pay
partiular attention.
We ompared the RDN with an Alphanet and three other models whih were implemented as ontrols. The rst of these was a network with a similar arhiteture
to the RDN (as shown in gure 2), exept that instead of xed onnetions of 1,
the hidden units have a set of adaptable weights to all M of the HMM's. We refer
to this network as a omparative density net (CDN). A seond ontrol experiment
used an arhiteture similar to a CDN without the hidden layer, i.e. there is a single
layer of adaptable weights diretly onneting the HMM's with the softmax output
units. We label this arhiteture a CDN-1. The CDN-1 diers from the Alphanet
in that eah softmax output unit has adaptable onnetions to the HMM's and we
an vary the number of HMM's, whereas the Alphanet has just one HMM per lass
diretly onneted to eah softmax output unit. Finally, we implemented a version
of a network similar to an Alphanet, but using a mixture of Gaussians as the input density model. The point of this omparison was to see if the HMM atually
ahieves a benet from modelling the temporal aspets of the speaker reognition
task.
In eah experiment an RDN onstruted out of a set of, M , 4-state HMM's was
ompared to the four other networks all mathed to have the same number of free
parameters, exept for the MoGnet. In the ase of the MoGnet, we used the same
number of Gaussian mixture models as HMM's in the Alphanet, eah with the
same number of hidden states. Thus, it has fewer parameters, beause it is laking
the transition probabilities of the HMM. We ran the experiment four times with