Expectation-Maximization algorithm on ELBO
EM (Expectation-maximizing); used when we do not have a complete data set of observations from X (a sample with conditional density), so we go like, ah what are the chances we see these things happen given there are things that haven't happened yet!
Again, not quite approximating inference, but rather to learn with an approximate posterior.
There are many algorithms that take the shape of a 2-step process per iteration, and they come with the con of mutual reliance.
Recall that the ELBO introduces the distribution q for us to estimate on top on the set of model parameters theta.
The EM procedure takes turns to make updates to q and theta. In summary,
alternated until convergence.
Can be thought of as a coordinate ascent procedure to maximize L.
More specifically, at every step, we are guaranteed to make some improvements to L — eventually we are guaranteed to converge to a spot with 0 gradient (just like gradient ascent, which is a special case of this where M step is small).
Pros: M step can be really good sometimes (say, with the exponential family), because instead of making some step-wise progress towards theta with given q, we can get the optimal theta in some computationally viable way.
Cons: Can be very slow. Convergence questionable — not guaranteed to be a local min. Convexity assumption (though in practice its effect is weakened).
Note that in computations similar to this one (MAP encoding and variational inference included), we are implicitly programming some mutual reliance between the eventual
$$ \{\theta, q\} $$
values, about which we ought to be careful. (Chicken and egg problem?)