$$ \newcommand{\dint}{\mathrm{d}} \newcommand{\vphi}{\boldsymbol{\phi}} \newcommand{\vpi}{\boldsymbol{\pi}} \newcommand{\vpsi}{\boldsymbol{\psi}} \newcommand{\vomg}{\boldsymbol{\omega}} \newcommand{\vsigma}{\boldsymbol{\sigma}} \newcommand{\vzeta}{\boldsymbol{\zeta}} \renewcommand{\vx}{\mathbf{x}} \renewcommand{\vy}{\mathbf{y}} \renewcommand{\vz}{\mathbf{z}} \renewcommand{\vh}{\mathbf{h}} \renewcommand{\b}{\mathbf} \renewcommand{\vec}{\mathrm{vec}} \newcommand{\vecemph}{\mathrm{vec}} \newcommand{\mvn}{\mathcal{MN}} \newcommand{\G}{\mathcal{G}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\S}{\mathcal{S}} \newcommand{\I}{\mathcal{I}} \newcommand{\diag}[1]{\mathrm{diag}(#1)} \newcommand{\diagemph}[1]{\mathrm{diag}(#1)} \newcommand{\tr}[1]{\text{tr}(#1)} \renewcommand{\C}{\mathbb{C}} \renewcommand{\R}{\mathbb{R}} \renewcommand{\E}{\mathbb{E}} \newcommand{\D}{\mathcal{D}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerbig}[1]{\left \langle #1 \right \rangle} \newcommand{\abs}[1]{\lvert #1 \rvert} \newcommand{\norm}[1]{\lVert #1 \rVert} \newcommand{\two}{\mathrm{II}} \newcommand{\GL}{\mathrm{GL}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\grad}[1]{\mathrm{grad} \, #1} \newcommand{\gradat}[2]{\mathrm{grad} \, #1 \, \vert_{#2}} \newcommand{\Hess}[1]{\mathrm{Hess} \, #1} \newcommand{\T}{\text{T}} \newcommand{\dim}[1]{\mathrm{dim} \, #1} \newcommand{\partder}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\rank}[1]{\mathrm{rank} \, #1} \newcommand{\inv}1 \newcommand{\map}{\text{MAP}} \newcommand{\L}{\mathcal{L}} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} $$

Convolution of Gaussians and the Probit Integral

Gaussian distributions are very useful in Bayesian inference due to their (many!) convenient properties. In this post we take a look at two of them: the convolution of two Gaussian pdfs and the integral of the probit function w.r.t. a Gaussian measure.

Convolution and the Predictive Distribution of Gaussian Regression

Let’s start with the convolution $\N(z_1 \mid \mu_1, \sigma^2_1) * \N(z_2 \mid \mu_2, \sigma^2_2)$ of two Gaussians $\N(z_1 \mid \mu_1, \sigma^2_1)$ and $\N(z_2 \mid \mu_2, \sigma^2_2)$ on $\R$:

\[\N(z_1 \mid \mu_1, \sigma^2_1) * \N(z_2 \mid \mu_2, \sigma^2_2) := \int_{\R} \N(z_1 - z_2 \mid \mu_1, \sigma^2_1) \, \N(z_2 \mid \mu_2, \sigma^2_2) \,dz_2 .\]

Proposition 1 (Convolution of Gaussians) Let $\N(z_1 \mid \mu_1, \sigma^2_1)$ and $\N(z_2 \mid \mu_2, \sigma^2_2)$ be two Gaussians on $\R$.

\[\N(z_1 \mid \mu_1, \sigma^2_1) * \N(z_2 \mid \mu_2, \sigma^2_2) = \N(z_1 \mid \mu_1+\mu_2, \sigma^2_1+\sigma^2_2) .\]

Proof. By the convolution theorem, the convolution of two functions is equivalent to the product of the functions’ Fourier transforms. The Fourier transform of a density function is given by its characteristic function. For a Gaussian $f(x) := \N(x, \mu, \sigma^2)$, it is $\varphi(u) := \exp\left(-iu\mu - \frac{1}{2}u^2\sigma^2\right)$. Therefore, if $\varphi_1$ and $\varphi_2$ are the characteristic functions of $\N(z_1 \mid \mu_1, \sigma^2_1)$ and $\N(z_2 \mid \mu_2, \sigma^2_2)$, respectively, then

\[\begin{align} (\varphi_1 \varphi_2)(u) &= \exp\left(-iu\mu_1 - \frac{1}{2}u^2\sigma_1^2\right) \exp\left(-iu\mu_2 - \frac{1}{2}u^2\sigma_2^2\right) \\[5pt] % &= \exp\left(-iu(\mu_1+\mu_2) - \frac{1}{2}u^2(\sigma_1^2 + \sigma_2^2)\right) , \end{align}\]

which we can immediately identify as the characteristic function of a Gaussian with mean $\mu_1 + \mu_2$ and variance $\sigma_1^2 + \sigma_2^2$.

$ \square $

This result is very useful in Bayesian machine learning, especially to obtain the predictive distribution of a Bayesian regression model. For instance, when one knows that the distribution over the regressor’s output is a Gaussian $\N(f \mid \mu, \sigma^2)$ and we assume that the output is noisy $\N(y \mid f, s^2)$.

Corollary 2 (Gaussian Regression). Let $p(y \mid f) = \N(y \mid f, s^2)$ and $p(f) = \N(f \mid \mu, \sigma^2)$ are Gaussians on $\R$. Then,

\[p(y) = \int_\R p(y \mid f) \, p(f) \,df = \N(y \mid f, \sigma^2 + s^2) .\]

Proof. First, notice that Gaussian is symmetric:

\[\begin{align} \N(x - a \mid \mu, \sigma^2) &= \frac{1}{Z} \exp\left(-\frac{1}{2\sigma^2} ((x-a)-\mu)^2\right) \\[5pt] % &= \frac{1}{Z} \exp\left(-\frac{1}{2\sigma^2} (x-(\mu+a))^2\right) \\[5pt] % &= \N(x \mid \mu + a, \sigma^2) , \end{align}\]

for $x, a \in \R$, where $Z$ is the normalizing constant. Using this, we can write the integral above as a convolution:

\[\begin{align} \int_\R \N(y \mid f, s^2) \, \N(f \mid \mu, \sigma^2) \,df &= \int_\R \N(y \mid 0+f, s^2) \, \N(f \mid \mu, \sigma^2) \,df \\[5pt] % &= \N(y \mid 0, s^2) * \N(f \mid \mu, \sigma^2) . \end{align}\]

Thus, by Proposition 1, we have $p(y) = \N(y \mid f, s^2 + \sigma^2)$.

$ \square $

The Probit Integral and the Probit Approximation

The probit function $\Phi$ is the cumulative distribution function of the standard Normal distribution $\N(x \mid 0, 1)$ on $\R$, i.e., $\Phi(z) := \int_{-\infty}^z \N(x \mid 0, 1) \,dx$. It can conveniently be written in terms of the error function

\[\mathrm{erf}(z) := \frac{2}{\sqrt{\pi}} \int_0^z \exp(-x^2) \,dx\]

\[\Phi(z) = \frac{1}{2} \left( 1 + \mathrm{erf}\left(\frac{z}{\sqrt{2}}\right) \right) .\]

Proposition 3 (The Probit Integral). If $\N(x \mid \mu, \sigma^2)$ be a Gaussian on $\R$ and $a, b \in \R$ then

\[\int_{\R} \Phi(ax + b) \, \N(x \mid \mu, \sigma^2) \,dx = \Phi\left(\frac{a\mu + b}{\sqrt{1 + a^2 \sigma^2}}\right).\]

Proof. The standard property of the error function [2] says that

\[\int_{\R} \mathrm{erf}(ax + b) \, \N(x \mid \mu, \sigma^2) \, dx = \mathrm{erf}\left(\frac{a\mu+b}{\sqrt{1 + 2 a^2 \sigma^2}}\right) .\]

So,

\[\require{cancel} \begin{align} \int_{\R} &\left(\frac{1}{2} + \frac{1}{2} \mathrm{erf}\left(\frac{ax+b}{\sqrt{2}}\right)\right) \, \N(x \mid \mu, \sigma^2) \,dx \\[5pt] % &= \frac{1}{2} + \frac{1}{2} \int_{\R} \mathrm{erf}\left(\left(\frac{a}{\sqrt{2}}\right)x+\left(\frac{b}{\sqrt{2}}\right)\right) \, \N(x \mid \mu, \sigma^2) \,dx \\[5pt] % &= \frac{1}{2} + \frac{1}{2} \mathrm{erf}\left(\frac{(a\mu+b)/\sqrt{2}}{\sqrt{1 + \cancel{2} (a/\cancel{\sqrt{2}})^2 \sigma^2}}\right) \\[5pt] % &= \frac{1}{2} \left(1 + \mathrm{erf}\left(\frac{a\mu+b}{\sqrt{2} \sqrt{1 + a^2 \sigma^2}}\right) \right) \\[5pt] % &= \Phi\left(\frac{a\mu+b}{\sqrt{1 + a^2 \sigma^2}}\right) . \end{align}\]

$ \square $

This integral is very useful for Bayesian inference since it enables us to approximate the following integral that is ubiquitous in Bayesian binary classifications

\[\int_{\R} \sigma(z) \, \N(z \mid m, s^2) \,dx ,\]

where $\sigma(z) := 1/(1 + \exp(-z))$ is the logistic function.

The key idea is to notice that the probit and logistic function are both sigmoid functions. That is, their graphs have a similar “S-shape”. Moreover, their images are both $[0, 1]$. However, they are a bit different—the probit function is more “horizontally stretched” compared to the logistic function.

So, the strategy to approximate the integral above is as follows: (i) horizontally “contract” the probit function and then (ii) use Proposition 3 to get an analytic approximation to the integral.

For the first step, this can be done by a simple change of coordinate: stretch the domain of the probit function with a constant $\lambda$, i.e., $z \mapsto \lambda z$. There are several “good” values for $\lambda$, but commonly it is chosen to be $\lambda = \sqrt{\pi/8}$, which makes the probit function have the same derivative as the logistic function at zero. That is, we have the approximation $\sigma(z) \approx \Phi(\lambda z) = \Phi(\sqrt{\pi/8} \, z)$.

Corollary 4. If $\N(z \mid m, s^2)$ is a Gaussian on $\R$, then

\[\int_{\R} \Phi(\lambda z) \, \N(z \mid m, s^2) \, dz = \Phi\left( \frac{m}{\sqrt{\lambda^{-2} + s^2}} \right) .\]

Proof. By Proposition 3, we have

\[\begin{align} \int_{\R} \Phi(\lambda \, z) \, \N(z \mid m, s^2) \, dz &= \Phi\left( \frac{\lambda \mu}{\sqrt{1 + \lambda^2 s^2}} \right) \\[5pt] % &= \Phi\left( \frac{\cancel{\lambda} \mu}{\cancel{\lambda} \sqrt{\lambda^{-2} + s^2}} \right) . \end{align}\]

$ \square $

Now we are ready to obtain the final approximation, often called the probit approximation.

Proposition 5 (Probit Approximation) If $\N(z \mid m, s^2)$ is a Gaussian on $\R$ and $\sigma(z) \approx \Phi\left(\sqrt{\pi/8} \, z\right)$, then

\[\int_{\R} \sigma(z) \, \N(z \mid m, s^2) \, dz \approx \sigma\left( \frac{m}{\sqrt{1 + \pi/8 \, s^2}} \right) .\]

Proof. Let $\lambda = \sqrt{\pi/8}$. Using Corollary 4 and substituting $\Phi(z) \approx \sigma\left(\lambda^{-1} \, z\right)$:

\[\begin{align} \int_{\R} \sigma(z) \, \N(z \mid m, s^2) \,dz &\approx \Phi\left( \frac{m}{\sqrt{\lambda^{-2} + s^2}} \right) \\[5pt] % &= \sigma\left( \frac{\lambda^{-1} \, m}{\sqrt{\lambda^{-2} + s^2}} \right) \\[5pt] % &= \sigma\left( \frac{\cancel{\lambda^{-1}} \, m}{\cancel{\lambda^{-1}} \, \sqrt{1 + \lambda^2 \, s^2}} \right) . \end{align}\]

Substituting $\lambda^2 = \pi/8$ into the last equation yields the desired result.

$ \square $

The probit approximation can also be used to obtain an approximation to the following integral, ubiquitous in multi-class classifications:

\[\int_{\R^k} \mathrm{softmax}(z) \, \N(z \mid \mu, \varSigma) \, dz ,\]

where the Gaussian is defined on $\R^k$ and the softmax function is identified by its components $\exp(z_i)/\sum_{j=1}^k \exp(z_j)$ for $i = 1, \dots, k$.

Proposition 6 (Multiclass Probit Approximation; Gibbs, 1998). If $\N(z \mid \mu, \varSigma)$ is a Gaussian on $\R^k$ and $\sigma(z) \approx \Phi(\sqrt{\pi/8}\,z)$, then

\[\int_{\R^k} \mathrm{softmax}(z) \, \N(z \mid \mu, \varSigma) \, dz \approx \mathrm{softmax}\left( \frac{\mu}{\sqrt{1 + \pi/8 \, \diag \varSigma}} \right) ,\]

where the division in the r.h.s. is component-wise.

Proof. The proof is based on [3]. Notice that we can write the $i$-th component of $\mathrm{softmax}(z)$ as $1/(1 + \sum_{j \neq i} \exp(-(z_i - z_j)))$. So, for each $i = 1, \dots, k$, using $z_{ij} := z_i - z_j$, we can write

\[\begin{align} \frac{1}{1 + \sum_{j \neq i} \exp(-z_{ij})} &= \frac{1}{1 - (K-1) + \sum_{j \neq i} \frac{1}{\frac{1}{1 + \exp(-z_{ij})}}} \\[5pt] % &= \frac{1}{2-K+\sum_{j \neq i} \frac{1}{\sigma(z_{ij})}} . \end{align}\]

Then, we use the following approximations (which admittedly might be quite loose):

$\E(f(x)) \approx f(\E(x))$,
the mean-field approximation $\N(z \mid \mu, \varSigma) \approx \N(z \mid \mu, \diag{\varSigma})$, and thus we have $z_i - z_j \sim \N(z_{ij} \mid \mu_i - \mu_j, \varSigma_{ii} + \varSigma_{jj})$, and
using the probit approximation (Proposition 5), with a further approximation

\[\begin{align} \int_{\R} \sigma(z_{ij}) \, \N(z_{ij} \mid \mu_i - \mu_j, \varSigma_{ii} + \varSigma_{jj}) \, dz_{ij} &\approx \sigma \left( \frac{\mu_i - \mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{ii} + \varSigma_{jj}}} \right) \\[5pt] % &\approx \sigma \left( \frac{\mu_i}{\sqrt{1 + \pi/8 \, \varSigma_{ii}}} - \frac{\mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{jj}}} \right) , \end{align}\]

we obtain

\[\begin{align} \int_{\R^k} \mathrm{softmax}_i(z) \, \N(z \mid \mu, \varSigma) &\approx \frac{1}{2-K+\sum_{j \neq i} \frac{1}{\E \sigma(z_{ij})}} \\[5pt] % &\approx \frac{1}{2-K+\sum_{j \neq i} \frac{1}{\sigma \left( \frac{\mu_i}{\sqrt{1 + \pi/8 \, \varSigma_{ii}}} - \frac{\mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{jj}}} \right)}} \\[5pt] % &= \frac{1}{1 + \sum_{j \neq i} \exp\left( -\left(\frac{\mu_i}{\sqrt{1 + \pi/8 \, \varSigma_{ii}}} - \frac{\mu_j}{\sqrt{1 + \pi/8 \, \varSigma_{jj}}} \right)\right)} \\[5pt] % &= \frac{\exp\left(\mu_i/\sqrt{1 + \pi/8 \, \varSigma_{ii}}\right)}{\sum_{j=1}^k \exp\left(\mu_j/\sqrt{1 + \pi/8 \, \varSigma_{jj}}\right)} \end{align}\]

We identify the last equation above as the $i$-th component of $\mathrm{softmax}\left( \frac{\mu}{\sqrt{1 + \pi/8 \, \diag \varSigma}} \right)$.

$ \square $

References

Ng, Edward W., and Murray Geller. “A table of integrals of the error functions.” Journal of Research of the National Bureau of Standards B 73, no. 1 (1969): 1-20.
Gibbs, Mark N. Bayesian Gaussian processes for regression and classification. Dissertation, University of Cambridge, 1998.
Lu, Zhiyun, Eugene Ie, and Fei Sha. “Mean-Field Approximation to Gaussian-Softmax Integral with Application to Uncertainty Estimation.” arXiv preprint arXiv:2006.07584 (2020).