Expetation Maximization for Hidden Markov Models: Derived from First Priniples Jesse Hoey Deember 15, 2000 Abstrat Expetation Maximization is a popular tehnique for deriving MAP estimation of model parameters. A suessful appliation is learning parameters of hidden Markov models (HMMs). This note derives the BaumWelsh learning algorithm for HMMs from rst priniples. 1 Introdution A hidden Markov model is a Bayesian network as shown (unrolled) in Figure 1. A set of hidden states X generates a set of observed variables Z at eah of a set of times t = 0; :::; T . We refer to the set of states at all times as X = fX0; :::; X g and Z = fZ0 ; :::; Z g. We assume for now that both the observations and the hidden variables are disrete valued with N and N values eah. At time t, Z , is assumed onditionally independent of all other variables at all other times given its parent X , and the usual Markovian independene assumption holds between the X s. The parameters of the model are threefold. First, the transition probabilities, = P (x jx 1 ), give the probability that the hidden state at time t will be X = i, given that the state at the previous time t t T T z x t t t xij t;i t ;j t time t time t+1 time T x0 xt xt+1 xT z0 zt zt+1 zT time 0 Figure 1: Hidden Markov model as a dynami Bayesian network. 1 step t 1 was X = j . Seond, the starting probabilities, = P (X0 ), are the probabilities that the system starts in state X0 = i. Third, the emission probabilities, = P (Z jX ), are the probability that observation Z = i will be made while in state X = j . We refer to the set of all parameters as = f ; ; P ig. Hidden Markov models have been extensively studied in the ontext of speeh reognition [5℄, along with many other appliations in areas suh as vision [2, 4℄. The fous of this note is the learning the maximum a-posteriori estimates of the parameters, , given by the values whih maximize P (Z; ). The standard tehnique is an appliation of the expetation-maximization (EM) algorithm of [1℄, whih results in an eÆient reursive estimation tehnique known as the Baum-Welsh or forward-bakward training proedure. This note will provide a simple derivation of this partiular appliation of the EM algorithm. The rst part, (Setion 2), borrows heavily from the derivation in [3℄, as onisely desribed by Thomas Minka 1 , and gives a general derivation of the expetation and maximization steps whih an be applied to any model. This is followed in Setion 3 by a derviation of the forward-bakward equations assuming multinomial distributions and Dirihlet priors. t i zij t;i ;i t;j t t X 2 Z EM derivation 2.1 General EM We wish to nd the values of whih maximize Z f () = P (Z; ) = P (Z; X; ); X where the integral is over the values of the hidden variables X in the HMM. We lower bound the value of f () with a funtion q(X ): Z P (Z; X; ) Y P (Z; X; ) ( ) ; (1) q(X ) f () = q(X ) q(X ) where the inequality is given by Jensen's inequality. Taking logarithms we get q X X X G(; q) = Z X q(X ) log P (Z; X; ) q(X ) log q(X ): (2) Then, at the urrent guess for , 0 , we hoose R q to maximize G, so that g touhes f at 0 . We use the onstraint that q(X ) = 1 and get P (Z; X; ) q(X ) = R = PP(Z;(Z;X;)) = P (X jZ; ) P (Z; X; ) The idea is then to X X 1 See his exellent white.media.mit.edu/ set of tutorial tpminka/papers/tutorial.html. notes The at EM ftp://vismod.www.media.mit.edu/pub/tpminka/papers/minka-em-tut.ps.gz 2 http://wwwtutorial is at 1. Choose q(X ) to maximize the bound at the urrent guess, 0. This just means getting P (X jZ; ) and is the \E" step of EM. 2. Maximize the bound over . This means maximizing 2 with the q(X ) derived in the \E" step. That is, we maximize Z X 2.2 P (X jZ; 0) log P (Z; X; ): (3) Multinomial-Dirihlet EM This setion will speialize the above proedure for probability distributions in the multinomial family, with Dirihlet onjugate priors. We show the derivation for in what follows. The other parameters an be similarily derived. Maximization of equation 3 an be performed by noting that there is a onstraint on , X = 1; X ij X ij X ij i whih means that we want to solve "Z X ij whih is X Z X X P (X jZ; 0) log P (Z; X; ) ( # 1) = 0 X ij i P (X jZ; 0) X ij [log P (Z; X j)P ()℄ = : (4) Assuming that the likelihood funtion of the ompleted data is given by a multinomial distribution, Y P (Z; X j0) = Xij Zij i ; N N X ij Z ij N i ij were N is the number of times X = i when X 1 = j in the (ompleted) data. Assuming further that the priors are given by a Dirihlet distribution P () = P ( )P ( )P () = Dir(j )Dir( j )Dir(j ); then equation 4 beomes X ij t X Z X t Z X ij P (X jZ; 0) X ij X [ (N X ij X Z + ) log X ij Z ij X ij + ij (N + ) log + Z ij ij Z ij Z ij X (N + ) log ℄ = : i i 3 i i i And therefore, Z X N P (X jZ; 0) = XZ = P (X jZ; 0) X ij X ij X ij X i + X ij R 0 = P R P P(X(XjZ;jZ;)N0 )N ++ R 0 = P R P (X jZ; )N0 + X ij X i X ij X X ij X ij X X ij P (X jZ; )N + P + E ( j 0 ) (N ) + E ( j 0 ) (N i = X ij X ij X X ij X ij X ij P X Z; ) So we see that the parameters an be updates by simply taking the expeted ounts, whih form the suÆient statistis for the multinomial distribution. 3 X ij i X ij P X Z; Baum-Welsh Derivation Our goal is then to nd the expetations E E ( j 0 ) (N P X Z; X ij ) = = where ( j 0 ) (N P X Z; Z Z X P (X jZ; 0)N 0 X ij ), whih is X ij X P (X0 :::X jZ0 :::Z ; 0 ) Æ(X )Æ(X 1 X :::X T T T t;i t t ;j ) (5) (6) if X = i ) = 01 otherwise The sum over t an be taken outside the integrations in (6), and the integrations over X0 :::X 2 ; X +1 :::X an be immediately performed, eah giving fators of 1. Further, the integrations over X 1 and X an be performed, sine the Æ-funtions simply pik out a partiular value of these varibles: X 1 and X . Thus, we are left with: Æ(X t t t t;i T t t t E ( j 0 ) (N P X Z; X ij )= X T =1 P (X ; X 1 jZ; ); t;i t ;j ;j t;i (7) t the expeted number of times X = i and X 1 = j given the data, Z = , and the urrent model parameters, . Fatoring the term in the sum by splitting the data Z up into two sets Z0 :::Z 1 and Z :::Z gives the BaumWelsh equations for updating the parameters of the HMM using expetation Z0 :::Z t t T t 4 t T maximization (we leave out the upon whih every term is onditioned) P (X ; X 1 jZ; ) = P (Z :::Z jX ; X 1 ; Z0 :::Z 1 )P (X ; X 1 ; Z0 :::Z 1 ) = P (Z :::Z jX X 1 )P (X jX 1 Z0 :::Z 1 )P (X 1 jZ0 :::Z 1 ) = P (Z jX X 1 Z +1 :::Z )P (Z +1 :::Z jX X 1 ) P (X 1 Z0 :::Z 1 ) P (X jX 1 ) P (Z0 :::Z 1 ) = P (Z jX )P (Z +1 :::Z jX )P (X jX 1 )P (X 1 Z0 :::Z 1 ) = t 1 whih is exatly the (unnormalized) equation (37) from [5℄. We use t , where the represents the value of the observation Z . We have left to evaluate the and terms (the forward and bakward variables, respetively). The alpha term is simply the joint probability of X = j and all the observations prior to time t. It an be expanded reursively by summing over the previous states, X 1 , as follows: = P (X Z0 :::Z ) = P (Z jX Z0 :::Z 1 )P (X Z0 :::Z 1 ) X = P (Z jX ) P (X X 1 Z0 :::Z 1 ) t;i t ;j t T t T t t;i t;i t t;i t ;j t ;j t ;j t;i t t t T t ;j T ;j ;j t t t t t;i t;i t t;i t ;j t ;j t ;j t t Z i t;i t t;i X ij T t t;i t;i t ;j t ;j t ;j Z i t t t t;j t;j t t t;j t t;j t t;j = P (Z jX ) t = X k t;j t j X Z t;j t t ;k t P (X jX 1 )P (X 1 Z0 :::Z 1 ) t;j t ;k t ;k t k X jk t 1 ;k k Similarly, the beta term is the probability of all observations posterior to time given the state at time t, X . It an be expanded reursively by summing over all next states, X +1 as follows: = P (Z +1 :::Z jX ) X = P (Z +1 jX +1 )P (Z +2 :::Z jX +1 )P (X +1 jX ) t, t t t;i t T t = X k t;i t ;k t+1 k t+1;k Z t T t ;k t ;k t;i X ki k The terms 0 and must be evaluated separately. The is initialized evenly, and the 0 = 0 . T ;i 4 Z i T i Baum-Welsh for POMDPs Consider further that we not only have evidene onditioned on (or generated by) the hidden states, X , but that there is evidene whih the hidden states are 5 e0+ e+t + et+1 eT+ x0 xt xt+1 xT e0− e −t − et+1 eT− Figure 2: POMDP as a dynami Bayesian network. onditioned on, as shown in Figure 2. This type of model is alled a partially observable Markov deision proess if the evidene e+ are ations, a, taken from a set of possible ations, A. Now the transition probabilities, are replaed by onditional probabilities for the hidden states given the ations and the previous states, = P (X je+ X 1 ). Now we are interested in examining the probability distributions onditioned on both sets +of evidene e+ + and e . We refer to the set of both evidenes as simply e = e0 :::e ; e0 :::e , and e = e+ ; e . P (X ; X 1 je; ) = P (e :::e jX ; X 1 ; e0 :::e 1 )P (X ; X 1 je0 :::e 1 ) = P (e :::e jX X 1 )P (X 1 je0 :::e 1 )P (X jX 1 e0:::e 1 ) = P (e jX X 1 e +1 :::e )P (e +1 :::e jX X 1 ) P (X 1 e0 :::e 1 ) P (X jX 1 ) P (e0 :::e 1 ) = P (e e+ jX X 1 )P (e +1 :::e jX )P (X 1 e0:::e 1 )P (X jX 1 ) = P (e "jX e+ )P (e+ jX X 1 ) 1 # t X X ijk t;i t t;k ;j T t t t;i T t t ;j t T t;i t T t;i t t;i t t t ;j t ;j t ;j t;i t t ;j T ;j t ;j t t t t;i T t;i t t ;j t ;j t t;i t ;j t t t;i t t;i t X t ;j t t t;i t T t ;j t;i t;i t t t;k t ;j t t;k k = P (e "jX )P (X je+ X 1 )P #(e+ jX X P (e+ jX 1 ) t;i t t;i X ijk k = t e i X ij t t ;j t t;k t ;j t;i t ;j ;j ) 1 ) 1 ;j t;i t ;j ;j P (e+ jX 1 ) 1 t t t t ;j P (X je+ X 1 )P (e+ jX 1 t;i ;j t;i t X ;j X ijk P (e+ jX 1 t;k t ;j ) k The new term whih shows up is the P (e+ jX 1 ), whih is referred to as the poliy in a POMDP. With a deterministi poliy, this term simply xes the t 6 t ;j value of e+ as the ation taken. However, in a more general ase, the poliy is probabilisti, and the 'ations' e+ are not xed. We an evaluate the alpha and beta terms muh as before, but they now inlude the poliy term as well. The alpha term is: = P (X e0 :::e ) = P (e jX e0 :::e 1 )P (X e0 :::e 1 ) X = P (e+e jX ) P (X X 1 e0 :::e 1 ) t t;j t;j t t t;j t t t;j t;j t t;j k = P (e jX )P (e+ jX ) t;j t = t e X t ;k t P (X jX 1 e0 :::e 1 )P (X 1 e0:::e 1 ) t;j t ;k t t ;k t k X jk j X t;j t t 1 t ;k k The beta term gives: = P (e +1 :::e jX ) X = P (e +1 jX +1 )P (e +2 :::e jX +1 )P (X +1 je++1 X )P (e++1 jX ) t;i = t X k T t t ;k t t+1 +1 e k t;i k t ;k T t P (e++1 jX Xk i t;i t ;k t ;k t t;i t t;i ) The new term whih shows up is the P (e+ jX 1 ), whih is referred to as the poliy in a POMDP. With a deterministi poliy, this term simply xes the value of e+ as the ation taken. t t ;j t Referenes [1℄ A.P. Dempster, N.M.Laird, and D.B. Rubin. Maximum likelihood from inomplete data using the EM algorithm. Journal of the Royal Statistial Soiety, 39(B), 1977. [2℄ Jesse Hoey and James J. Little. Representation and reognition of omplex human motion. In Pro. IEEE CVPR, Hilton Head, SC, June 2000. [3℄ Stephen P. Luttrell. Partitioned mixture distribution: an adaptive Bayesian network for low-level vision proessing. IEEE Pro. on Vision, Image and Signal Proessing, 141(4):251{260, 1994. [4℄ Carlos Morimoto, Yaser Yaoob, and Larry Davis. Reognition of head gestures using hidden Markov models. In Proeeding of ICPR, pages 461{ 465, Austria, 1996. [5℄ Lawrene R. Rabiner. A tutorial on hidden Markov models and seleted appliations in speeh reognition. Proeedings of the IEEE, 77(2):257{296, February 1989. 7

1/--страниц