How can the variance of the score function estimator be decreased?

Remind that the score function estimator moves the gradient of an expectation inside the expecation to allow for Monte Carlo integration, i.e.,

\[ \begin{align} \nabla_{\boldsymbol{\theta}} \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \big[ f (\textbf{z})\big] &= \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \left[ f(\textbf{z}) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \right] \\ &\approx \frac {1}{N} \sum_{i=1}^N f\left(\textbf{z}^{(i)}\right) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} \left( \textbf{z}^{(i)}\right) \quad \text{with} \quad \textbf{z}^{(i)} \sim p_{\boldsymbol{\theta}}(\textbf{z}), \end{align} \]

where the term \(\nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}} (\textbf{z})\) is called the score function. A nice property of the score function is that its expectated value is zero, i.e.,

\[ \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \left[ \nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}} (\textbf{z}) \right] = \textbf{0} \]

Using this property and the linearity of the expectation, we can add an arbitrary term with zero expectation:

\[ \begin{align} \nabla_{\boldsymbol{\theta}} \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \big[ f (\textbf{z})\big] &= \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \left[ \Big(f(\textbf{z}) - \lambda \Big)\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \right] \\ &\approx \frac {1}{N} \sum_{i=1}^N \Big(f\left(\textbf{z}^{(i)}\right) - \lambda \Big)\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} \left( \textbf{z}^{(i)}\right) \quad \text{with} \quad \textbf{z}^{(i)} \sim p_{\boldsymbol{\theta}}(\textbf{z}), \end{align} \]

where \(\lambda\) is called control variate or baseline which allows us to decrease the variance. This can be done via

minimzing the variance of the above term w.r.t. \(\lambda\), or
using a data-dependent baseline or neural baseline in which we train the control variate such that

\[ \lambda_{\boldsymbol{\phi}} (\textbf{z}) \approx f(\textbf{z}) \]

Expectation of Score Function

The expectation of the score function is zero:

\[ \begin{align} \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \left[ \nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}} (\textbf{z}) \right] &= \int \nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}} (\textbf{z}) p_{\boldsymbol{\theta}} (\textbf{z}) d\textbf{z} \\ &=\int \frac {\nabla_{\boldsymbol{\theta}} p_{\boldsymbol{\theta}} (\textbf{z})} {p_{\boldsymbol{\theta}} (\textbf{z})} p_{\boldsymbol{\theta}} (\textbf{z}) d\textbf{z} &&\text{(log derivative trick)}\\ &= \nabla_{\boldsymbol{\theta}} \int p_{\boldsymbol{\theta}} (\textbf{z}) d\textbf{z} &&\text{(Leibniz integral rule)}\\ &= \nabla_{\boldsymbol{\theta}} \textbf{1} = \textbf{0} \end{align} \]

The log derivative trick comes from the application of the chain rule, see this post.

Variance of Score Function

The variance of the score function estimator without baseline \(\lambda\) \[ \begin{align} &\text{Var} \Big[ \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \left[ f(\textbf{z})\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \right] \Big] \\ &= \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \Big[ \Big( f(\textbf{z}) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \Big)^2\Big] - \Big(\mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \Big[ f(\textbf{z}) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \Big] \Big)^2\\ &= \text{Var}_{\text{score}} \end{align} \]

The variance of the score function estimator including baseline \(\lambda\)

\[ \begin{align} &\text{Var} \Big[ \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \left[ \Big(f(\textbf{z}) - \lambda \Big)\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \right] \Big] \\ &= \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \Big[ \big( \Big(f(\textbf{z}) - \lambda \Big)\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \big)^2\Big] - \Big(\mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \Big[ \Big(f(\textbf{z}) - \lambda \Big)\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \Big] \Big)^2\\ &=\text{Var}_{\text{score}} - \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \Big[ \big( 2 f(\textbf{z}) \lambda - \lambda^2 \big) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \Big] + \Big( \mathbb{E}_{\textbf{z} \sim p_{\boldsymbol{\theta}}\left(\textbf{z} \right)} \big[ \lambda \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}} (\textbf{z}) \big] \Big)^2 \end{align} \]