close

Вход

Забыли?

вход по аккаунту

Экспертные методики максимизации прибыли;pdf

код для вставкиСкачать
Expetation Maximization for Hidden Markov
Models:
Derived from First Priniples
Jesse Hoey
Deember 15, 2000
Abstrat
Expetation Maximization is a popular tehnique for deriving MAP
estimation of model parameters. A suessful appliation is learning parameters of hidden Markov models (HMMs). This note derives the BaumWelsh learning algorithm for HMMs from rst priniples.
1
Introdution
A hidden Markov model is a Bayesian network as shown (unrolled) in Figure 1.
A set of hidden states X generates a set of observed variables Z at eah of a set
of times t = 0; :::; T . We refer to the set of states at all times as X = fX0; :::; X g
and Z = fZ0 ; :::; Z g. We assume for now that both the observations and the
hidden variables are disrete valued with N and N values eah. At time
t, Z , is assumed onditionally independent of all other variables at all other
times given its parent X , and the usual Markovian independene assumption
holds between the X s. The parameters of the model are threefold. First,
the transition probabilities, = P (x jx 1 ), give the probability that the
hidden state at time t will be X = i, given that the state at the previous time
t
t
T
T
z
x
t
t
t
xij
t;i
t
;j
t
time t
time t+1
time T
x0
xt
xt+1
xT
z0
zt
zt+1
zT
time 0
Figure 1: Hidden Markov model as a dynami Bayesian network.
1
step t 1 was X = j . Seond, the starting probabilities, = P (X0 ), are
the probabilities that the system starts in state X0 = i. Third, the emission
probabilities, = P (Z jX ), are the probability that observation Z = i
will be made while in state X = j . We refer to the set of all parameters as
= f ; ; P ig.
Hidden Markov models have been extensively studied in the ontext of speeh
reognition [5℄, along with many other appliations in areas suh as vision [2, 4℄.
The fous of this note is the learning the maximum a-posteriori estimates of
the parameters, , given by the values whih maximize P (Z; ). The standard
tehnique is an appliation of the expetation-maximization (EM) algorithm
of [1℄, whih results in an eÆient reursive estimation tehnique known as the
Baum-Welsh or forward-bakward training proedure. This note will provide
a simple derivation of this partiular appliation of the EM algorithm. The
rst part, (Setion 2), borrows heavily from the derivation in [3℄, as onisely
desribed by Thomas Minka 1 , and gives a general derivation of the expetation
and maximization steps whih an be applied to any model. This is followed in
Setion 3 by a derviation of the forward-bakward equations assuming multinomial distributions and Dirihlet priors.
t
i
zij
t;i
;i
t;j
t
t
X
2
Z
EM derivation
2.1
General EM
We wish to nd the values of whih maximize
Z
f () = P (Z; ) = P (Z; X; );
X
where the integral is over the values of the hidden variables X in the HMM. We
lower bound the value of f () with a funtion q(X ):
Z P (Z; X; )
Y P (Z; X; ) ( )
;
(1)
q(X ) f () =
q(X )
q(X )
where the inequality is given by Jensen's inequality. Taking logarithms we get
q X
X
X
G(; q) =
Z
X
q(X ) log P (Z; X; ) q(X ) log q(X ):
(2)
Then, at the urrent guess for , 0 , we hoose
R q to maximize G, so that g
touhes f at 0 . We use the onstraint that q(X ) = 1 and get
P (Z; X; )
q(X ) = R
= PP(Z;(Z;X;)) = P (X jZ; )
P (Z; X; )
The idea is then to
X
X
1 See
his
exellent
white.media.mit.edu/
set
of
tutorial
tpminka/papers/tutorial.html.
notes
The
at
EM
ftp://vismod.www.media.mit.edu/pub/tpminka/papers/minka-em-tut.ps.gz
2
http://wwwtutorial
is
at
1. Choose q(X ) to maximize the bound at the urrent guess, 0. This just
means getting P (X jZ; ) and is the \E" step of EM.
2. Maximize the bound over . This means maximizing 2 with the q(X )
derived in the \E" step. That is, we maximize
Z
X
2.2
P (X jZ; 0) log P (Z; X; ):
(3)
Multinomial-Dirihlet EM
This setion will speialize the above proedure for probability distributions in
the multinomial family, with Dirihlet onjugate priors. We show the derivation
for in what follows. The other parameters an be similarily derived. Maximization of equation 3 an be performed by noting that there is a onstraint
on ,
X
= 1;
X ij
X ij
X ij
i
whih means that we want to solve
"Z
X ij
whih is
X
Z
X
X
P (X jZ; 0) log P (Z; X; ) ( #
1) = 0
X ij
i
P (X jZ; 0)
X ij
[log P (Z; X j)P ()℄ = :
(4)
Assuming that the likelihood funtion of the ompleted data is given by a multinomial distribution,
Y
P (Z; X j0) = Xij Zij i ;
N
N
X ij
Z ij
N
i
ij
were N is the number of times X = i when X 1 = j in the (ompleted)
data. Assuming further that the priors are given by a Dirihlet distribution
P () = P ( )P ( )P () = Dir(j )Dir( j )Dir(j );
then equation 4 beomes
X ij
t
X
Z
X
t
Z
X ij
P (X jZ; 0)
X ij
X
[ (N
X ij
X
Z
+ ) log X ij
Z ij
X ij
+
ij
(N + ) log +
Z ij
ij
Z ij
Z ij
X
(N + ) log ℄ = :
i
i
3
i
i
i
And therefore,
Z
X
N
P (X jZ; 0)
=
XZ
=
P (X jZ; 0)
X ij
X ij
X ij
X
i
+
X ij
R
0
= P R P P(X(XjZ;jZ;)N0 )N ++ R
0
= P R P (X jZ; )N0 + X ij
X
i
X ij
X
X ij
X ij
X
X ij
P (X jZ; )N + P + E ( j 0 ) (N )
+ E ( j 0 ) (N
i
=
X ij
X ij
X
X ij
X ij
X ij
P X Z;
)
So we see that the parameters an be updates by simply taking the expeted
ounts, whih form the suÆient statistis for the multinomial distribution.
3
X ij
i
X ij
P X Z;
Baum-Welsh Derivation
Our goal is then to nd the expetations E
E ( j 0 ) (N
P X Z;
X ij
) =
=
where
( j 0 ) (N
P X Z;
Z
Z
X
P (X jZ; 0)N
0
X ij
), whih is
X ij
X
P (X0 :::X jZ0 :::Z ; 0 ) Æ(X )Æ(X 1
X :::X
T
T
T
t;i
t
t
;j
) (5)
(6)
if X = i
) = 01 otherwise
The sum over t an be taken outside the integrations in (6), and the integrations
over X0 :::X 2 ; X +1 :::X an be immediately performed, eah giving fators
of 1. Further, the integrations over X 1 and X an be performed, sine the
Æ-funtions simply pik out a partiular value of these varibles: X 1 and X .
Thus, we are left with:
Æ(X
t
t
t
t;i
T
t
t
t
E ( j 0 ) (N
P X Z;
X ij
)=
X
T
=1
P (X ; X 1 jZ; );
t;i
t
;j
;j
t;i
(7)
t
the expeted number of times X = i and X 1 = j given the data, Z =
, and the urrent model parameters, . Fatoring the term in the sum
by splitting the data Z up into two sets Z0 :::Z 1 and Z :::Z gives the BaumWelsh equations for updating the parameters of the HMM using expetation
Z0 :::Z
t
t
T
t
4
t
T
maximization (we leave out the upon whih every term is onditioned)
P (X ; X 1 jZ; )
= P (Z :::Z jX ; X 1 ; Z0 :::Z 1 )P (X ; X 1 ; Z0 :::Z 1 )
= P (Z :::Z jX X 1 )P (X jX 1 Z0 :::Z 1 )P (X 1 jZ0 :::Z 1 )
= P (Z jX X 1 Z +1 :::Z )P (Z +1 :::Z jX X 1 )
P (X 1 Z0 :::Z 1 )
P (X jX 1 )
P (Z0 :::Z 1 )
= P (Z jX )P (Z +1 :::Z jX )P (X jX 1 )P (X 1 Z0 :::Z 1 )
= t 1
whih is exatly the (unnormalized) equation (37) from [5℄. We use t , where
the represents the value of the observation Z . We have left to evaluate the
and terms (the forward and bakward variables, respetively). The alpha
term is simply the joint probability of X = j and all the observations prior to
time t. It an be expanded reursively by summing over the previous states,
X 1 , as follows:
= P (X Z0 :::Z )
= P (Z jX Z0 :::Z 1 )P (X Z0 :::Z 1 )
X
= P (Z jX ) P (X X 1 Z0 :::Z 1 )
t;i
t
;j
t
T
t
T
t
t;i
t;i
t
t;i
t
;j
t
;j
t
;j
t;i
t
t
t
T
t
;j
T
;j
;j
t
t
t
t
t;i
t;i
t
t;i
t
;j
t
;j
t
;j
t
t
Z
i
t;i
t
t;i
X ij
T
t
t;i
t;i
t
;j
t
;j
t
;j
Z
i
t
t
t
t;j
t;j
t
t
t;j
t
t;j
t
t;j
= P (Z jX )
t
= X
k
t;j
t j
X
Z
t;j
t
t
;k
t
P (X jX 1 )P (X 1 Z0 :::Z 1 )
t;j
t
;k
t
;k
t
k
X jk
t
1
;k
k
Similarly, the beta term is the probability of all observations posterior to time
given the state at time t, X . It an be expanded reursively by summing
over all next states, X +1 as follows:
= P (Z +1 :::Z jX )
X
=
P (Z +1 jX +1 )P (Z +2 :::Z jX +1 )P (X +1 jX )
t,
t
t
t;i
t
T
t
=
X
k
t;i
t
;k
t+1 k t+1;k
Z
t
T
t
;k
t
;k
t;i
X ki
k
The terms 0 and must be evaluated separately. The is initialized evenly,
and the 0 = 0 .
T
;i
4
Z i
T
i
Baum-Welsh for POMDPs
Consider further that we not only have evidene onditioned on (or generated
by) the hidden states, X , but that there is evidene whih the hidden states are
5
e0+
e+t
+
et+1
eT+
x0
xt
xt+1
xT
e0−
e −t
−
et+1
eT−
Figure 2: POMDP as a dynami Bayesian network.
onditioned on, as shown in Figure 2. This type of model
is alled a partially
observable Markov deision proess if the evidene e+ are ations, a, taken
from a set of possible ations, A. Now the transition probabilities, are
replaed by onditional probabilities for the hidden states given the ations
and the previous states, = P (X je+ X 1 ). Now we are interested in
examining the probability distributions onditioned on both sets +of evidene
e+
+
and e . We refer to the set of both evidenes as simply e = e0 :::e ; e0 :::e ,
and e = e+ ; e .
P (X ; X 1 je; )
= P (e :::e jX ; X 1 ; e0 :::e 1 )P (X ; X 1 je0 :::e 1 )
= P (e :::e jX X 1 )P (X 1 je0 :::e 1 )P (X jX 1 e0:::e 1 )
= P (e jX X 1 e +1 :::e )P (e +1 :::e jX X 1 )
P (X 1 e0 :::e 1 )
P (X jX 1 )
P (e0 :::e 1 )
= P (e e+ jX X 1 )P (e +1 :::e jX )P (X 1 e0:::e 1 )P (X jX 1 )
= P (e "jX e+ )P (e+ jX X 1 ) 1 #
t
X
X ijk
t;i
t
t;k
;j
T
t
t
t;i
T
t
t
;j
t
T
t;i
t
T
t;i
t
t;i
t
t
t
;j
t
;j
t
;j
t;i
t
t
;j
T
;j
t
;j
t
t
t
t;i
T
t;i
t
t
;j
t
;j
t
t;i
t
;j
t
t
t;i
t
t;i
t
X
t
;j
t
t
t;i
t
T
t
;j
t;i
t;i
t
t
t;k
t
;j
t
t;k
k
= P (e "jX )P (X je+ X 1 )P #(e+ jX
X
P (e+ jX 1 )
t;i
t
t;i
X ijk
k
= t
e
i
X ij
t
t
;j
t
t;k
t
;j
t;i
t
;j
;j
)
1 ) 1
;j
t;i
t
;j
;j
P (e+ jX 1 ) 1
t
t
t
t
;j
P (X je+ X 1 )P (e+ jX 1
t;i
;j
t;i
t
X
;j
X ijk
P (e+ jX 1
t;k
t
;j
)
k
The new term whih shows up is the P (e+ jX 1 ), whih is referred to as the
poliy in a POMDP. With a deterministi poliy, this term simply xes the
t
6
t
;j
value of e+ as the ation taken. However, in a more general ase, the poliy is
probabilisti, and the 'ations' e+ are not xed. We an evaluate the alpha and
beta terms muh as before, but they now inlude the poliy term as well. The
alpha term is:
= P (X e0 :::e )
= P (e jX e0 :::e 1 )P (X e0 :::e 1 )
X
= P (e+e jX ) P (X X 1 e0 :::e 1 )
t
t;j
t;j
t
t
t;j
t
t
t;j
t;j
t
t;j
k
= P (e jX )P (e+ jX )
t;j
t
= t
e
X
t
;k
t
P (X jX 1 e0 :::e 1 )P (X 1 e0:::e 1 )
t;j
t
;k
t
t
;k
t
k
X jk
j
X
t;j
t
t
1
t
;k
k
The beta term gives:
= P (e +1 :::e jX )
X
=
P (e +1 jX +1 )P (e +2 :::e jX +1 )P (X +1 je++1 X )P (e++1 jX )
t;i
=
t
X
k
T
t
t
;k
t
t+1 +1 e
k
t;i
k
t
;k
T
t
P (e++1 jX
Xk i
t;i
t
;k
t
;k
t
t;i
t
t;i
)
The new term whih shows up is the P (e+ jX 1 ), whih is referred to as
the poliy in a POMDP. With a deterministi poliy, this term simply xes the
value of e+ as the ation taken.
t
t
;j
t
Referenes
[1℄ A.P. Dempster, N.M.Laird, and D.B. Rubin. Maximum likelihood from
inomplete data using the EM algorithm. Journal of the Royal Statistial
Soiety, 39(B), 1977.
[2℄ Jesse Hoey and James J. Little. Representation and reognition of omplex
human motion. In Pro. IEEE CVPR, Hilton Head, SC, June 2000.
[3℄ Stephen P. Luttrell. Partitioned mixture distribution: an adaptive Bayesian
network for low-level vision proessing. IEEE Pro. on Vision, Image and
Signal Proessing, 141(4):251{260, 1994.
[4℄ Carlos Morimoto, Yaser Yaoob, and Larry Davis. Reognition of head
gestures using hidden Markov models. In Proeeding of ICPR, pages 461{
465, Austria, 1996.
[5℄ Lawrene R. Rabiner. A tutorial on hidden Markov models and seleted
appliations in speeh reognition. Proeedings of the IEEE, 77(2):257{296,
February 1989.
7
1/--страниц
Пожаловаться на содержимое документа