Variational Autoencoder explanation

Explanation of VAE.

Likelihood-based generative models

Generative Model is one of two model types in deep learning, which has the ability of generating data by sampling from the approximated distribution of data. That process requires the model to understand or have the ability to simulate the distribution of the given data. To do that, one straight forward approach is modeling the density function of data by a neural network \(p_{\theta}(x)\). That class of models called likelihood-based generative models. The objective is to maximize the likelihood function indexed by a set of paramters \(\theta\):

\[\begin{align*} \max_{\theta} \sum_{i} \log p_{\theta}(x^{(i)}) \end{align*}\]

The problem, from here, is to choose a proper architecture that can not only efficiently calculate the likelihood \(p_{\theta}(x)\) for training but also easily sample from. There are multiple ways to achieve that, such as autoregressive models , using an assumption of time-series data, or flow model . In this blog, we explore another approach which does not calculate exactly the likelihood but approximates it by an inferencing technique known as variational inference.

Latent variable models

VAE is inspired by latent variable models, which rely on an assumption about data - there is a compact representation of a data point in a lower-dimensional space. The representation is known as latent code. That means, again, one can encode a data set in the original space into a set of code in the latent space (look at the below image) while maintaining the properties of data distribution.

\[p_{\theta}(x) = \mathbb{E}_{z \sim p_{Z}} p_{\theta}(x\|z)\]

Technically, in case of descrete latent variable \(z\), \(p_{\theta}(x)\) can be transformed to \(\sum_{z}p_{Z}(z)p_{\theta}(x\|z)\). Unforturnately, in many real-world problems, \(z\) is continuous. Then, the challenge is calculating \(\int_{z}p_{Z}(z)p_{\theta}(x\|z)dz\). In that case, likelihood term \(p_{\theta}(x)\) can not be exactly calculated but approximated by some techniques. In next section, we will explore one of that techniques called variational inference.

Variational inference and evidence lower bound

Applying Jensens inequality to the (above) log likelihood, we have the evidence lower bound: \(\begin{align*} \log{p_{\theta}(x^{(i)})} &= \log \int_{z} p_{\theta}(x^{(i)}, z) dz \\ &= \log \int_{z} p_{\theta}(x^{(i)}, z) \frac{q_{\phi}(z|x^{(i)})}{q_{\phi}(z|x^{(i)})} dz \\ &= \log \mathbb{E}_{q_{\phi}(z|x^{(i)})} \frac{p_{\theta}(x^{(i)}, z)}{q_{\phi}(z|x^{(i)})} \\ &\geq \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log \left[ \frac{p_{\theta}(x^{(i)}, z)}{q_{\phi}(z)} \right] \textnormal{(Jensen inequality)}\\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)}, z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \mathcal{L}(\theta, \phi, x^{(i)}) \end{align*}\)

Where \(q_{\phi}(z)\) is called recognition model and \(\mathcal{L}(\theta, \phi, x^{(i)})\) is called evidence lower bound of the log likelihood function. To calculate ELBO, it is essential to reckon the expectation over recognition distribution. Prior approaches, such as Monte Carlo sampling \(\mathbb{E}_{q_{\phi}}[f(z)] \approx \frac{1}{K} \sum_{i=1}^{K} f(z^{(i)})\) is high variance which leads to unreasonable result. In the next section, a reparameterization trick is introduced to solve this problem.

Reparameterization trick

With \(z\) is a random variable, and \(z \sim q_{\phi}(z|x)\), it is possible to express \(z\) as a deterministic variable \(z=g_{\phi}(\epsilon, x)\), where \(\epsilon\) is an auxiliary variable with independent marginal \(p(\epsilon)\).

For example, assume that \(z \sim q_{\phi}(z, x) = \mathcal{N}(\mu, \sigma^{2})\). \(z\) can be expressed as \(z=\mu + \sigma \epsilon\) where \(\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).

Using above trick, the ELBO can be re-written as:

\[\begin{align*} \mathcal{L}(\theta, \phi, x^{(i)}) &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)}, z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \frac{1}{L} \sum_{l=1}^{L} \left[ \log p_{\theta}(x^{(i)}, g_{\phi}(x^{(i)}, \epsilon^{(l)})) - \log q_{\phi}(g_{\phi}(x^{(i)}, \epsilon^{(l)}) | x^{(i)})\right] \end{align*}\]

Where \(\epsilon^{(l)} \sim p(\epsilon)\). Since \(p(\epsilon)\) is independent, sampling is now easy with low variance. From here, optimizing ELBO is as normal as optimizing other loss function using gradient descent. In the next section, another perspective of this loss function is investigated and illustrated the connection to auto-encoder.

Connection to auto-encoder

Further transforming the evidence lower bound give us another view of this loss function:

\[\begin{align*} \mathcal{L}(\theta, \phi, x^{(i)}) &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)}, z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)} | z) + \log p_{\theta}(z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log p_{\theta}(x^{(i)}|z) - \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log \frac{q_{\phi}(z | x^{(i)})}{p_{\theta}(z)} \\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log p_{\theta}(x^{(i)}|z) - \mathrm{D}_{KL}(q_{\phi}(z|x^{(i)}) || p_{\theta}(z)) \end{align*}\]

The first term is the reconstruction loss and the second term can be thought as a regularization term. Because of that, this model can be thought as the variational auto-encoder.

Optimizing loss function using this view is similar to the mentioned above method.

Conclusion

Auto-encoding Variational Bayes introduced the recognition model \(q_{\phi}(z|x)\) to approximate the true posterior \(p_{\theta}(z|x)\) which is intractable in general case. In addition, to efficiently calculate the loss term, AVB also proposed a reparameterization trick which allow sampling from recognition model \(q_{\phi}(z|x)\) easily with low variance.