The fundamental difference between VAEs and other types of autoencoders is that while most autoencoders learn discrete latent space models, VAEs learn continuous latent variable models. Rather than a single encoding vector for latent space, VAEs model two different vectors: a vector of means, “μ,” and a vector of standard deviations, “σ.” Because these vectors capture latent attributes as a probability distribution—that is, they learn a stochastic encoding rather than a deterministic encoding—VAEs allow for interpolation and random sampling, greatly expanding their capabilities and use cases. This means that VAEs are generative AI models.
In simpler terms, VAEs learn to encode important feature learnings from the inputs in the datasets they’re trained on in a flexible, approximate way that allows them to generate new samples that resemble the original training data. The loss function used to minimize reconstruction error is regularized by the KL divergence between the probability distribution of training data (the prior distribution) and the distribution of latent variables learned by the VAE (the posterior distribution). This regularized loss function enables VAEs to generate new samples that resemble the data it was trained on while avoiding overfitting, which in this case would mean generating new samples too identical to the original data.
To generate a new sample, the VAE samples a random latent vector (ε) from within the unit Gaussian—in other words, selects a random starting point from within the normal distribution—shifts it by the mean of the latent distribution (μ) and scales it by the variance of the latent distribution (σ). This process, called the reparameterization trick,5 avoids direct sampling of the variational distribution: because the process in random, it has no derivative—which eliminates the need for backpropagation over the sampling process.
When a VAE is being used for generative tasks, the encoder can often be discarded after training. More advanced evolutions of VAEs, like conditional VAEs, give a user more control over generated samples by providing conditional inputs that modify the output of the encoder.