Explanation of VAE.
Generative Model is one of two model types in deep learning, which has the ability of generating data by sampling from the approximated distribution of data. That process requires the model to understand or have the ability to simulate the distribution of the given data. To do that, one straight forward approach is modeling the density function of data by a neural network \(p_{\theta}(x)\). That class of models called likelihood-based generative models. The objective is to maximize the likelihood function indexed by a set of paramters \(\theta\):
\[\begin{align*} \max_{\theta} \sum_{i} \log p_{\theta}(x^{(i)}) \end{align*}\]The problem, from here, is to choose a proper architecture that can not only efficiently calculate the likelihood \(p_{\theta}(x)\) for training but also easily sample from. There are multiple ways to achieve that, such as autoregressive models
VAE
Technically, in case of descrete latent variable \(z\), \(p_{\theta}(x)\) can be transformed to \(\sum_{z}p_{Z}(z)p_{\theta}(x\|z)\). Unforturnately, in many real-world problems, \(z\) is continuous. Then, the challenge is calculating \(\int_{z}p_{Z}(z)p_{\theta}(x\|z)dz\). In that case, likelihood term \(p_{\theta}(x)\) can not be exactly calculated but approximated by some techniques. In next section, we will explore one of that techniques called variational inference.
Applying Jensens inequality to the (above) log likelihood, we have the evidence lower bound: \(\begin{align*} \log{p_{\theta}(x^{(i)})} &= \log \int_{z} p_{\theta}(x^{(i)}, z) dz \\ &= \log \int_{z} p_{\theta}(x^{(i)}, z) \frac{q_{\phi}(z|x^{(i)})}{q_{\phi}(z|x^{(i)})} dz \\ &= \log \mathbb{E}_{q_{\phi}(z|x^{(i)})} \frac{p_{\theta}(x^{(i)}, z)}{q_{\phi}(z|x^{(i)})} \\ &\geq \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log \left[ \frac{p_{\theta}(x^{(i)}, z)}{q_{\phi}(z)} \right] \textnormal{(Jensen inequality)}\\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)}, z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \mathcal{L}(\theta, \phi, x^{(i)}) \end{align*}\)
Where \(q_{\phi}(z)\) is called recognition model and \(\mathcal{L}(\theta, \phi, x^{(i)})\) is called evidence lower bound of the log likelihood function. To calculate ELBO, it is essential to reckon the expectation over recognition distribution. Prior approaches, such as Monte Carlo sampling \(\mathbb{E}_{q_{\phi}}[f(z)] \approx \frac{1}{K} \sum_{i=1}^{K} f(z^{(i)})\) is high variance which leads to unreasonable result. In the next section, a reparameterization trick is introduced to solve this problem.
With \(z\) is a random variable, and \(z \sim q_{\phi}(z|x)\), it is possible to express \(z\) as a deterministic variable \(z=g_{\phi}(\epsilon, x)\), where \(\epsilon\) is an auxiliary variable with independent marginal \(p(\epsilon)\).
For example, assume that \(z \sim q_{\phi}(z, x) = \mathcal{N}(\mu, \sigma^{2})\). \(z\) can be expressed as \(z=\mu + \sigma \epsilon\) where \(\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\).
Using above trick, the ELBO can be re-written as:
\[\begin{align*} \mathcal{L}(\theta, \phi, x^{(i)}) &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)}, z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \frac{1}{L} \sum_{l=1}^{L} \left[ \log p_{\theta}(x^{(i)}, g_{\phi}(x^{(i)}, \epsilon^{(l)})) - \log q_{\phi}(g_{\phi}(x^{(i)}, \epsilon^{(l)}) | x^{(i)})\right] \end{align*}\]Where \(\epsilon^{(l)} \sim p(\epsilon)\). Since \(p(\epsilon)\) is independent, sampling is now easy with low variance. From here, optimizing ELBO is as normal as optimizing other loss function using gradient descent. In the next section, another perspective of this loss function is investigated and illustrated the connection to auto-encoder.
Further transforming the evidence lower bound give us another view of this loss function:
\[\begin{align*} \mathcal{L}(\theta, \phi, x^{(i)}) &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)}, z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \left[\log p_{\theta}(x^{(i)} | z) + \log p_{\theta}(z) - \log q_{\phi}(z | x^{(i)}) \right]\\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log p_{\theta}(x^{(i)}|z) - \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log \frac{q_{\phi}(z | x^{(i)})}{p_{\theta}(z)} \\ &= \mathbb{E}_{q_{\phi}(z|x^{(i)})} \log p_{\theta}(x^{(i)}|z) - \mathrm{D}_{KL}(q_{\phi}(z|x^{(i)}) || p_{\theta}(z)) \end{align*}\]The first term is the reconstruction loss and the second term can be thought as a regularization term. Because of that, this model can be thought as the variational auto-encoder.
Optimizing loss function using this view is similar to the mentioned above method.
Auto-encoding Variational Bayes introduced the recognition model \(q_{\phi}(z|x)\) to approximate the true posterior \(p_{\theta}(z|x)\) which is intractable in general case. In addition, to efficiently calculate the loss term, AVB also proposed a reparameterization trick which allow sampling from recognition model \(q_{\phi}(z|x)\) easily with low variance.