Expectation-Maximization algorithm on ELBO

Introduction

EM (Expectation-maximizing); used when we do not have a complete data set of observations from X (a sample with conditional density), so we go like, ah what are the chances we see these things happen given there are things that haven't happened yet!

Again, not quite approximating inference, but rather to learn with an approximate posterior.

There are many algorithms that take the shape of a 2-step process per iteration, and they come with the con of mutual reliance.

Procedure

Recall that the ELBO introduces the distribution q for us to estimate on top on the set of model parameters theta.

The EM procedure takes turns to make updates to q and theta. In summary,

E steps maximize ELBO with respect to q
- effectively repeatedly constructing a lower bound for likelihood log p(v;theta).
M steps maximize ELBO with respect to theta
- We choose our own algorithm here — how much parameter change is adjustable. Gradient ascent only improves for a certain step size based times gradient.

alternated until convergence.

Can be thought of as a coordinate ascent procedure to maximize L.

More specifically, at every step, we are guaranteed to make some improvements to L — eventually we are guaranteed to converge to a spot with 0 gradient (just like gradient ascent, which is a special case of this where M step is small).

Remarks

Pros: M step can be really good sometimes (say, with the exponential family), because instead of making some step-wise progress towards theta with given q, we can get the optimal theta in some computationally viable way.
Cons: Can be very slow. Convergence questionable — not guaranteed to be a local min. Convexity assumption (though in practice its effect is weakened).
Note that in computations similar to this one (MAP encoding and variational inference included), we are implicitly programming some mutual reliance between the eventual

$$ \{\theta, q\} $$

values, about which we ought to be careful. (Chicken and egg problem?)
- The crux usually lies in the many choices involved in how we sample or restrict the distribution q.