Note: This post is an exposition of the mathematics behind the variational autoencoder. A PDF version (with equation numbers and better formatting) can be seen here. I also have two earlier posts that are relevant to the variational autoencoder: one on the implementation of the variational autoencoder, and one on the reparameterization trick
The variational autoencoder (VA)1 is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational principles. In latent variable models, we assume that the observed are generated from some latent (unobserved) ; these latent variables capture some “interesting” structure in the observed data that is not immediately visible from the observations themselves. For example, a latent variable model called independent components analysis can be used to separate the individual speech signals from a recording of people talking simultaneously. More formally, we can think of a latent variable model as a probability distribution describing the generative process (of how is generated from ) along with a prior on latent variables . This corresponds to the following simple graphical model
Learning in a latent variable model
Our purpose in such a model is to learn the generative process, i.e., (we assume is known). A good would assign high probabilities to observed ; hence, we can learn a good by maximizing the probability of observed data, i.e., . Assuming that is parameterized by , we need to solve the following optimization problem
where . This is a difficult optimization problem because it involves a possibly intractable integral over .
Posterior inference in a latent variable model
For the moment, let us set aside this learning problem and focus on a different one: posterior inference of . As we will see shortly, this problem is closely related to the learning problem and in fact leads to a method for solving it. Given and , we would like to infer the posterior distribution . This is usually rather difficult because it involves an integral over , . For most latent variable models, this integral cannot be evaluated, and needs to be approximated. For example, we can use Markov chain Monte Carlo techniques to sample from the posterior. However, here we will look at an alternative technique based on variational inference. Variational inference converts the posterior inference problem into the optimization problem of finding an approximate probability distribution that is as close as possible to . This can be formalized as solving the following optimization problem
where parameterizes the approximation and denotes the Kullback-Leibler divergence between and and is given by . However, this optimization problem is no easier than our original problem because it still requires us to evaluate . Let us see if we can get around that. Plugging in the definition of KL, we can write,
where we defined
Since is independent of , minimizing is equivalent to maximizing . Note that optimizing is much easier since it only involves which does not involve any intractable integrals. Hence, we can do variational inference of the posterior in a latent variable model by solving the following optimization problem
Back to the learning problem
The above derivation also suggests a way for learning the generative model . We see that is in fact a lower bound on the log probability of observed data
where we used the fact that KL is never negative. Now, assume instead of doing posterior inference, we fix and learn the generative model . Then is now a function of , . Since is a lower bound on , we can maximize as a proxy for maximizing . In fact, if , the KL term will be zero and , i.e., maximizing will be equivalent to maximizing . This suggests maximizing with respect to both and to learn and at the same time.
A brief aside on expectation maximization (EM)
EM can be seen as one particular strategy for solving the above maximization problem. In EM, the E step consists of calculating the optimal based on the current (which is the posterior ). In the M step, we plug the optimal into and maximize it with respect to . In other words, EM can be seen as a coordinate ascent procedure that maximizes with respect to and in alternation.
Solving the maximization problem
One can use various techniques to solve the above maximization problem. Here, we will focus on stochastic gradient ascent since the variational autoencoder uses this technique. In gradient-based approaches, we evaluate the gradient of our objective with respect to model parameters and take a small step in the direction of the gradient. Therefore, we need to estimate the gradient of . Assuming we have a set of samples from , we can form the following Monte Carlo estimate of
and . The derivative with respect to is easy to estimate since appears only inside the sum.
It is the derivative with respect to that is harder to estimate. We cannot simply push the gradient operator into the sum since the samples used to estimate are from which depends on . This can be seen by noting that , where . The standard estimator for the gradient of such expectations is in practice too high variance to be useful (See Appendix for further details). One key contribution of the variational autoencoder is a much more efficient estimate for that relies on what is called the reparameterization trick.
We would like to estimate the gradient of an expectation of the form . The problem is that the gradient with respect to is difficult to estimate because appears in the distribution with respect to which the expectation is taken. If we can somehow rewrite this expectation in such a way that appears only inside the expectation, we can simply push the gradient operator into the expectation. Assume that we can obtain samples from by sampling from a noise distribution and pushing them through a differentiable transformation
Then we can rewrite the expectation as follows
Assuming we have a set of samples from , we can form a Monte Carlo estimate of
Now appears only inside the sum, and the derivative of with respect to can be estimated in the same way we did for . This in essence is the reparameterization trick and reduces the variance in the estimates of dramatically, making it feasible to train large latent variable models. We can find an appropriate noise distribution and a differentiable transformation for many choices of approximate posterior (see the original paper1 for several strategies). We will see an example for the multivariate Gaussian distribution below when we talk about the variational autoencoder.
Variational Autoencoder (VA)
The above discussion of latent variable models is general, and the variational approach outlined above can be applied to any latent variable model. We can think of the variational autoencoder as a latent variable model that uses neural networks (specifically multilayer perceptrons) to model the approximate posterior and the generative model . More specifically, we assume that the approximate posterior is a multivariate Gaussian with a diagonal covariance matrix. The parameters of this Gaussian distribution are calculated by a multilayer perceptron (MLP) that takes as input. We denote this MLP with two nonlinear functions and that map from to the mean and standard deviation vectors respectively.
For the generative model , we assume is fixed to a unit multivariate Gaussian, i.e., . The form of depends on the nature of the data being modeled. For example, for real , we can use a multivariate Gaussian, or for binary , we can use a Bernoulli distribution. Here, let us assume that is real and is Gaussian. Again, we assume the parameters of are calculated by a MLP. However, note that this time the input to the MLP are , not . Denoting this MLP with two nonlinear functions and that map from to the mean and standard deviation vectors respectively, we have
Looking at the network architecture of this model, we can see why it is called an autoencoder.
The input is mapped probabilistically to a code by the encoder , which in turn is mapped probabilistically back to the input space by the decoder .
In order to learn and , the variational autoencoder uses the variational approach outlined above. We sample from and use these to obtain a Monte Carlo estimate of the variational lower bound . Then we take the derivative of this bound with respect to the parameters and use these in a stochastic gradient ascent procedure to learn and . As we discussed above, in order to reduce the variance of our gradient estimates, we apply the reparameterization trick. We would like to reparameterize the multivariate Gaussian distribution using a noise distribution and a differentiable transformation . We assume are sampled from a multivariate unit Gaussian, i.e., . Then if we let , will have the desired distribution ( denotes elementwise multiplication). Therefore, we can rewrite the variational lower bound using this reparameterization for as follows
There is one more simplification we can make. Writing explicitly as , we see
Since both and are Gaussian, the KL term has a closed form expression. Plugging that in, we get the following expression for the variational lower bound.
Here we assumed that has dimensions and used and to denote the th dimension of the mean and standard deviation vectors for . So far we have looked at a single data point . Now, assume that we have a dataset with datapoints and draw a random sample of datapoints. The variational lower bound estimate for the minibatch is simply the average of values for each
where is given above. In order to learn and , we can take the derivative of the above expression and use these in a stochastic gradient-ascent procedure.
An example application
As an example application, we train a variational autoencoder on the
handwritten digits dataset MNIST. We use multilayer perceptrons with one
hidden layer with 500 units for both the encoder () and
decoder () networks. Because we are interested in
visualizing the latent space, we set the number of latent dimensions to 2.
We use 1 sample per datapoint () and 100 datapoints per
minibatch (). We use stochastic gradient-ascent with a learning
rate of 0.001 and train for 500 epochs (i.e., 500 passes over the whole
dataset). The implementation can be seen online at
We provide two implementations, one in pure
theano and one in
lasagne. We plot the latent space by varying the latent along each of
the two dimensions and sampling from the learned generative model. We see that
the model is able to capture some interesting structure in the set of handwritten
digits (compare to Fig4b in the original paper1).
One common approach to estimating the gradient of such expectations is to make use of the identity
This estimator is known by various names in the literature from the REINFORCE algorithm to score function estimator or likelihood ratio trick. Given samples from , we can form an unbiased Monte Carlo estimate of the gradient of with respect to using this estimator. However, in practice it exhibits too high variance to be useful.