Consider the following case: Let \(\textbf{X} = \{\textbf{x}^{(i)}\}_{i=1}^N\) denote a dataset consisting of \(N\) i.i.d. samples where each observed datapoint \(\textbf{x}^{(i)}\) is generated from a process in which firstly a latent (hidden) variable \(\textbf{z}^{(i)}\) is sampled from a prior distribution \(p_{\boldsymbol{\theta}} (\textbf{z})\) and then \(\textbf{x}^{(i)}\) is sampled from a conditional distribution \(p_{\boldsymbol{\theta}} \left(\textbf{x} | \textbf{z}^{(i)}\right)\).
The evidence lower bound (ELBO) \(\mathcal{L} \left( \boldsymbol{\theta}, \boldsymbol{\phi}; \textbf{x}^{(i)}\right)\) (or variational lower bound) which is
\[ \mathcal{L} \left( \boldsymbol{\theta}, \boldsymbol{\phi}; \textbf{x}^{(i)} \right) = - D_{KL} \left( q_{\boldsymbol{\phi}} \left( \textbf{z} | \textbf{x}^{(i)} \right) || p_{\boldsymbol{\theta}} (\textbf{z})\right) + \mathbb{E}_{q_{\boldsymbol{\phi}} \left(\textbf{z}|\textbf{x}^{(i)}\right)} \left[ \log p_{\boldsymbol{\theta}} \left( \textbf{x}^{(i)} | \textbf{z}\right) \right], \]
defines a lower bound on the log-evidence \(\log p_{\boldsymbol{\theta}}(\textbf{x}^{(i)})\) given a variational approximation \(q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x}^{(i)} \right)\) of the true posterior \(p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}^{(i)}\right)\), i.e.,
\[ \begin{align} 0 &\ge \log p_{\boldsymbol{\theta}} (\textbf{x}^{(i)}) = \underbrace{D_{KL} \left( q_{\boldsymbol{\phi}} \left( \textbf{z} | \textbf{x}^{(i)} \right) || p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}^{(i)}\right)\right)}_{\ge 0} + \mathcal{L} \left( \boldsymbol{\theta}, \boldsymbol{\phi}; \textbf{x}^{(i)}\right),\\ &\Rightarrow \mathcal{L} \left( \boldsymbol{\theta}, \boldsymbol{\phi} \right) \le \log p_{\boldsymbol{\theta}} (\textbf{x}^{(i)}), \end{align} \] where \(p_{\boldsymbol{\theta}} (\textbf{z})\) is a prior on the latent distribution. Note that true posterior \(p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}^{(i)}\right)\) (which is often unknown) does not appear in the ELBO!
Variational Bayesian methods are a very popular framework in machine learning, since they allow to cast statistical inference problems into optimization problems. E.g., the inference problem of determining the true posterior distribution \(p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}^{(i)}\right)\) can be cast into an optimization problem by maximizing the ELBO using/introducing a variational approximation \(q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)\) and a latent prior \(p_{\boldsymbol{\theta}} (\textbf{z})\). This is the main idea of variational auto-encoders (VAEs) by Kingma and Welling (2013).
Let’s start with the optimization problem:
\[ \min_{\boldsymbol{\phi}} D_{KL} \left[ q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) || p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right) \right], \]
i.e., we are aiming to find the parameters \(\boldsymbol{\phi}\) such that the true probability distributions are as similiar as possible (have minimal KL divergence, just ignore the fact that the KL divergence is not symmetric). Actually, we cannot compute this quantity since we do not have access to the true posterior (if we had, we wouldn’t need to introduce a variational approximation).
However, we can rewrite the KL divergence as follows
\[ \begin{align} D_{KL} \left[ q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) || p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right) \right] &= \int_\textbf{z} q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) \log \frac {q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} {p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right)} d\textbf{z}\\ &= \mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[\log q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right) \right] \end{align} \]
Remember Bayes rule:
\[ p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right) = \frac {p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z}\right) p_{\boldsymbol{\theta}} \left( \textbf{z} \right) } {p_{\boldsymbol{\theta}} \left(\textbf{x}\right)} \]
Let’s put this into the above equation (and use the logarithm rules)
\[ \begin{align} D_{KL} \left[ q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) || p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right) \right] & = \mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[ \log q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) - \left(\log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z} \right) + \log p_{\boldsymbol{\theta}} \left( \textbf{z} \right) - \log p_{\boldsymbol{\theta}} \left(\textbf{x}\right) \right)\right] \\ &=\mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[ \log q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{z} \right) \right] + \log p_{\boldsymbol{\theta}} \left(\textbf{x}\right) \end{align} \]
That already looks suspiciously close to what we acutally want to show. Let’s put the log evidence term on one side and the rest on the other side to better see what we have
\[ \begin{align} \log p_{\boldsymbol{\theta}} \left(\textbf{x}\right) &= D_{KL} \left[ q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) || p_{\boldsymbol{\theta}} \left( \textbf{z} | \textbf{x}\right) \right] \\ &\quad - \underbrace{\mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[ \log q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{z} \right) \right] + \log p_{\boldsymbol{\theta}} \left(\textbf{x}\right)}_{- \mathcal{L}} \end{align} \]
Some final rewritings
\[ \begin{align} \mathcal{L} &= - \mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[ \log q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{z} \right) \right] \\ &= -\mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[ \log q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right) - \log p_{\boldsymbol{\theta}} \left( \textbf{z} \right) \right] + \mathbb{E}_{\textbf{z} \sim q_{\boldsymbol{\phi}} \left(\textbf{z} | \textbf{x} \right)} \left[ \log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z} \right)\right]\\ &= - D_{KL} \left( q_{\boldsymbol{\phi}} \left( \textbf{z} | \textbf{x} \right) || p_{\boldsymbol{\theta}} (\textbf{z})\right) + \mathbb{E}_{q_{\boldsymbol{\phi}} \left(\textbf{z}|\textbf{x}\right)} \left[ \log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z}\right) \right] \end{align} \]
Taking Jensen’s inequality is a different path to arrive at the variational lower bound on the log-likelihood.
Jensen’s Inequality: Let \(X\) be a random variable and \(\varphi\) a concave function, then
\[ \varphi \Big(\mathbb{E} \left[ X \right] \Big) \ge \mathbb{E} \Big[ \varphi \left( X \right) \Big] \]
Let’s simply state the marginal likelihood of \(\textbf{x}\) and include our variatonal approximation of the true posterior:
\[ p_{\boldsymbol{\theta}}(\textbf{x}) = \int p_{\boldsymbol{\theta}} (\textbf{x}, \textbf{z}) d\textbf{z} = \int q_{\phi} \left( \textbf{z}| \textbf{x} \right) \frac {p_{\boldsymbol{\theta}} (\textbf{x}, \textbf{z})} {q_{\phi} \left( \textbf{z} | \textbf{x} \right)} d\textbf{z} = \mathbb{E}_{\textbf{z} \sim q_{\phi} \left( \textbf{z}| \textbf{x} \right)} \left[ \frac {p_{\boldsymbol{\theta}} (\textbf{x}, \textbf{z})} {q_{\phi} \left( \textbf{z} | \textbf{x} \right)} \right] \]
Applying logarithm (concave function) and Jensen’s inequality, we arrive at
\[ \log p_{\boldsymbol{\theta}}(\textbf{x}) = \log \mathbb{E}_{\textbf{z} \sim q_{\phi} \left( \textbf{z}| \textbf{x} \right)} \left[ \frac {p_{\boldsymbol{\theta}} (\textbf{x}, \textbf{z})} {q_{\phi} \left( \textbf{z} | \textbf{x} \right)} \right] \ge \mathbb{E}_{\textbf{z} \sim q_{\phi} \left( \textbf{z}| \textbf{x} \right)} \left[ \log \frac {p_{\boldsymbol{\theta}} (\textbf{x}, \textbf{z})} {q_{\phi} \left( \textbf{z} | \textbf{x} \right)} \right] = \mathcal{L} \]
Some final rewritings
\[ \begin{align*} \mathcal{L} &= \mathbb{E}_{\textbf{z} \sim q_{\phi} \left( \textbf{z}| \textbf{x} \right)} \left[ \log p_{\boldsymbol{\theta}} (\textbf{z} ) + \log p_{\boldsymbol{\theta}} (\textbf{x} | \textbf{z} ) - \log q_{\phi} \left( \textbf{z} | \textbf{x} \right) \right] \\ &= -\mathbb{E}_{\textbf{z} \sim q_{\phi} \left( \textbf{z}| \textbf{x} \right)} \left[ \log q_{\phi} \left( \textbf{z} | \textbf{x} \right) - \log p_{\boldsymbol{\theta}} (\textbf{z} )\right] + \mathbb{E}_{\textbf{z} \sim q_{\phi} \left( \textbf{z}| \textbf{x} \right)} \left[ \log p_{\boldsymbol{\theta}} (\textbf{x} | \textbf{z} ) \right]\\ &= - D_{KL} \left( q_{\boldsymbol{\phi}} \left( \textbf{z} | \textbf{x} \right) || p_{\boldsymbol{\theta}} (\textbf{z})\right) + \mathbb{E}_{q_{\boldsymbol{\phi}} \left(\textbf{z}|\textbf{x}\right)} \left[ \log p_{\boldsymbol{\theta}} \left( \textbf{x} | \textbf{z}\right) \right] \end{align*} \]