Variational autoencoder

In machine learning, a variational autoencoder (VAE), is a generative model, meaning that it can generate things that it has not seen before. It incorporates artificial neural networks, and variational inference. At the high level, it optimizes a mathematical construct that is called the Evidence Lower Bound (ELBO). It was introduced by Diederik P. Kingma and Max Welling.^[1]

People often confuse variational autoencoders with simple autoencoders. A simple autoencoder is a neural network with a shape like an hourglass. It has an input layer, a small middle layer, and an output layer. The middle layer is very small, so the network must compress the information that passes through into only a few numbers. This small middle part is called the encoding.

A variational autoencoder is different, even though it looks similar. Instead of one small middle part, it has two funnels that become one later through a specific process. The two funnels are separate neural network layers. Both encoder layers look at the input and both create small encodings. One encoding is interpreted as the mean, and the other one is used like a variance. These words come from probability theory and statistics, where they are used to define probability distributions. During training, a special loss function called the Kullback-Leibler Divergence pushes these values towards a predefined probability distribution, usually a normal (bell-shaped) distribution with a mean set to zero and a variance set to one.

To define a probability distribution like the normal distribution, you need two sets of numbers. The mean and the variance, which together are called the parameters of the normal distribution. For this reason the variational autoencoder has two encoder layers that are separate (one can also think them as entirely separate neural networks for simplicity). A third neural network is also trained. This third part is called the decoder, or the generative model. It turns small encodings into a large output. In other words, it generates data instead of compressing them.

The decoder does not use the mean and variance directly. First, a random number is chosen. This random number is multiplied by the variance's square root and then added to the mean. The result is a sample from the probability distribution that the two encoders have learned. The sample is passed then into the decoder. This step is important to do in this way because it separates randomness from learning. This trick is called the reparameterization trick because it separates the parameters from the randomness that we add to create samples. Because the randomness is controlled, the loss function is able to work correctly with little randomness in the backpropagation. By keeping the randomness in one clear location, training becomes stable, and the decoder can produce good-quality outputs. The network can learn useful encodings while still being able to generate new data.

The loss function is completed with the conditional data distribution, otherwise called the noise distribution. This part of the loss function measures how likely it is that the decoder produces the original input from a sampled encoding. Together with the Kullback-Leibler Divergence, which compares the learned distribution to the chosen normal distribution, this forms the full loss function of the variational autoencoder. By optimizing the loss, the model learns both to reconstruct input data accurately and to generate new data that follows the same overall structure as the training data.

Related pages

References

↑ Kingma, Diederik P.; Welling, Max (2022-12-10). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [cs, stat].

This short article about technology can be made longer. You can help Wikipedia by adding to it.

[1] Kingma, Diederik P.; Welling, Max (2022-12-10). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [cs, stat].

[1]