RELIABILITY AND ROBUSTNESS

31

For a fixed model m, the analytically intractable Bayesian posterior distribution
of the parameters θ is given by Bayes’ rule:
P (D | θ)
P (θ | D) =

P (D | θ)P (θ | m)
P (D | m)

P (θ)
P (θ | D)

likelihood of θ (in model m)
prior probability of θ

(2.18)

posterior of θ given data D

The denominator P (D|m) is intractable, since it requires integrating over all
possible parameter values weighted by their probabilities. This is known as
the inference problem, which is the main challenge in BDL, as the posterior
distribution is required to compute the predictive distribution for any new input
(Equation (3.1) further explains this).
In practice, BNNs are often implemented as Variational Inference (VI)
methods, which approximate the high-dimensional posterior distribution with a
tractable distribution family, such as a Gaussian distribution [46]. Let p(θ | D)
be the intractable posterior distribution of parameters θ given observed data D,
which will be approximated with a simpler, conjugate distribution q(θ|D; φ),
parameterized by φ (e.g., mean and variance).
The key idea consists of finding the optimal variational parameters φ∗ that
minimize the Kullback–Leibler (KL) divergence between the approximating
distribution q(θ|D; φ) and the replaced true posterior p(θ | D). This is achieved
by maximizing the evidence lower bound (ELBO), given by:

ELBO(φ) = Eq(θ|D;φ) [log p(D|θ)] − KL[q(θ|D; φ)||p(θ)]
Z

(2.19)

p(D|θ)p(θ)
dθ
(2.20)
q(θ|D; φ)
Z
Z
q(θ|D; φ)
= q(θ|D; φ) log p(D|θ)dθ − q(θ|D; φ) log
dθ, (2.21)
p(θ)

=

q(θ|D; φ) log

where the first term Equation (2.21) represents the expected likelihood of the
data given the parameters, and the second term quantifies the dissimilarity
between the variational distribution and the prior distribution over the
parameters. Maximizing the ELBO with φ is equivalent to minimizing the KL
divergence between q(θ|D; φ) and p(θ|D), thereby providing a lower bound on the
log marginal likelihood log p(D) ≥ ELBO(φ), after the parameters θ have been
integrated out. By optimizing the variational parameters φ, we simultaneously