Bayesian Neural Networks
and
Variational Inference

Data analysis and machine learning

\[ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\model{\boldsymbol{\mathcal{M}}} \newcommand\vecx{\boldsymbol{x}} \newcommand{\transpose}{^{\scriptscriptstyle \top}} \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\cov}{\mathbf{C}} \]

In previous lectures we described how neural networks can be used to make predictions and classifications. In the classification setting, under the hood it is actually regression being performed and taking a softmax of the outputs in order to return a normalised pseudo-probability of classification for each potential category. That pseudo-probability is very much pseudo-, as that value is relative to other output categories.And conditioned on the network, its hyperparameters, et cetera. What if we gave it something totally different to what the training set? We will still get a prediction of which object it is (e.g., say which number between 0 and 9), and we may even an output that says one class output is far more probable than the others.Indeed, if you read the MNIST chicken blog post you will see that a model was 99.9% confident that a picture of a chicken was actually the digit 5. For contrast, for a handwritten digit that is "obviously" a 5, the model is 76.4% confident. What we are not getting, however, is representative uncertainty on the predictions from the network! You may think this is because estimating uncertainty is too expensive.And you're mostly right.

Bayesian Neural Networks will change that. And as you will see, Variational Inference (which can be used for Bayesian Neural Networks) can be used to approximate and propagate uncertainty in complex models where it would be too expensive to do so otherwise.

Bayesian Neural Networks

A Bayesian Neural Network can have any architecture that you would otherwise have for a "standard" neural network. The principle difference introduced by the prefix Bayesian is that we will:

  1. Introduce a prior \(p(w)\) on all the weights \(w\) in the network.
  2. Propagate our uncertainty in the weights to the output predictions such that we have posteriors of belief for a given class (or posterior predictions).

The claim is that Bayesian Neural Networks are able to capture epistemic uncertainty: describing the uncertainty in having the right model.I doubt it. Let's go through the math. Recall that to marginalise over any parameter, you will need a prior. Let us set a prior over the weights as \[ p(\vec{w}) = \frac{\exp{\left(-\alpha{}E_\vec{w}\right)}}{Z_\vec{w}\left(\alpha\right)} \] where \(\alpha\) is some hyperparameter and \[ Z_\vec{w}\left(\alpha\right) = \int\exp\left(-\alpha{}E_\vec{w}\right) d\vec{w} \quad . \] In theory we can set any prior we want, but in practice I think you will find that your prior on weights of a neural network will matter a lot! If we were introducing a simple neural networkLike the one we built in numpy. then you could use something like Stan or PyMC and just build your neural network with the weights as model parameters and use Hamiltonian Monte Carlo to sample the parameters. In this kind of situation we would expect that the priors on the weights would act to regularise the weights, and (may) avoid us from needing to perform dropout or cross-validation.

Here's what that might look like in Python, where we can feed this model directly to PYMC and use a Hamiltonian Monte Carlo sampler to sample posteriors on the weights in the network: # Based on https://docs.pymc.io/notebooks/bayesian_neural_network_advi.html import theano import numpy as np import pymc3 as pm floatX = theano.config.floatX def construct_nn(ann_input, ann_output): n_hidden = 5 # Initialize random weights between each layer init_1 = np.random.randn(X.shape[1], n_hidden).astype(floatX) init_2 = np.random.randn(n_hidden, n_hidden).astype(floatX) init_out = np.random.randn(n_hidden).astype(floatX) with pm.Model() as neural_network: ann_input = pm.Data('ann_input', X_train) ann_output = pm.Data('ann_output', Y_train) # Weights from input to hidden layer weights_in_1 = pm.Normal('w_in_1', 0, sigma=1, shape=(X.shape[1], n_hidden), testval=init_1) # Weights from 1st to 2nd layer weights_1_2 = pm.Normal('w_1_2', 0, sigma=1, shape=(n_hidden, n_hidden), testval=init_2) # Weights from hidden layer to output weights_2_out = pm.Normal('w_2_out', 0, sigma=1, shape=(n_hidden,), testval=init_out) # Build neural-network using tanh activation function act_1 = pm.math.tanh(pm.math.dot(ann_input, weights_in_1)) act_2 = pm.math.tanh(pm.math.dot(act_1, weights_1_2)) act_out = pm.math.sigmoid(pm.math.dot(act_2, weights_2_out)) # Binary classification -> Bernoulli likelihood out = pm.Bernoulli('out', act_out, observed=ann_output, total_size=Y_train.shape[0] # IMPORTANT for minibatches ) return neural_network neural_network = construct_nn(X_train, Y_train)

But if our neural network was very deep, with many tens or hundreds of thousands of weights? Hamiltonian Monte Carlo is efficient, but can we afford to be sampling so many model parameters? And should we explicitly be calculating partial derivatives between all model parameters many parameters are uncorrelated? The answer to all of these questions is no. Thankfully, there are methods of approximating densities — including posterior densities — that we can use. These methods are described within the field of variational inference and are sufficiently powerful such that they can make an intractable problem tractable.In fact, variational inference is already baked in to some of the tools you have already been using (like stan).

Variational Inference

Imagine you have some complex density that you would like to describe. That density might be a posterior density, and by describing it here I mean we would like to draw samples from that density so you can report quantiles, et cetera. Sampling that density can be extremely expensive, or even intractable.

In variational inference we chose a distribution family that could describe the target density (our posterior) that we do care about, and replace the sampling problem as an optimisation problem. Even though optimisation is Hard™, it is much easier and much more efficient than sampling. So any time you can cast an inference problem as an optimisation problem, you are replacing a very hard problem with a (relatively) easy problem! There are a few steps to understand how variational inference is implemented in practice in order to make it generalisable. I recommend this lecture by David Blei (Columbia) if you are interested.

Suppose our intractable probability distribution is \(p\). Variational inference will try to solve an optimisation problem over a class of tractable distributions \(\mathcal{Q}\) in order to find a \(q \in \mathcal{Q}\) that is most similar to \(p\), where similarity is measured by the divergenceFor example: Kullback-Leibler divergence. between two distributions. We will then query \(q\) (instead of \(p\)) in order to get an approximate solution in a finite time. In other words, we are trying to optimise some parameters for the distribution family \(\mathcal{Q}\) such that the distribution \(q\) closely approximates the distribution \(p\), and the hope is that the chosen distribution family \(\mathcal{Q}\) will have some nice properties like a closed-form integral, that it is infintitely differentiable, et cetera.

We need some measure for how close \(q\) and \(p\) need to be, and what exactly close means here. In most variational inference problems we use the Kullback-Leibler (K-L) divergence, because it makes the optimisation problem tractable.There isn't a good theoretical reason (to my knowledge!) as to why we should use K-L divergence over any other divergence. And there are problems with the K-L divergence! So we only use it because, empirically, it works well enough. We've introduced the K-L divergence a few times through this course, but each time we have just said it is a measure of (improper) distance from one distribution to another. Formally, the K-L divergence from distribution \(p\) to distribution \(q\) is \[ \mathcal{KL}\left(q\,||\,p\right) = \sum_{x} q(x)\log{\left(\frac{q(x)}{p(x)}\right)} \quad \geq 0 \textrm{ for all }q, p \] The divergence is a measure of (improper) distance, and the divergence from \(q\) to \(p\) is not necessarily the same as the divergence from \(p\) to \(q\). The divergence is used to measure the information in each distribution, and how much information it would take to translate one distribution to another.

Consider then if we have some intractable posterior distribution \(p\) and we are going to approximate it with some distribution \(q\). All we would need to do is minimize the K-L divergence from \(p\) to \(q\). It turns out optimising the K-L divergence \(\mathcal{KL}(q\,||\,p)\) is not possible because of the intractable normalisation constant (the fully marginalised likelihood) in our posterior \(p\). In fact, we can't even properly evaluate \(p\) because we don't know the normalisation constant. Instead, we optimise something that involves the unnormalised proabability, known as the variational lower boundAlso known as ELBO: evidence lower bound, or expectation on the lower bound.. If we take \(\tilde{p}\) to be the unnormalised posterior probability \[ p(x; \theta) = \frac{1}{Z(\theta)}\prod_{n} \tilde{p}(x_n; \theta) \] and use only that in our K-L divergence calculation to define an objective function \(J(q)\) \[ \begin{eqnarray} J(q) &=& \sum_{x}q(x)\log\frac{q(x)}{\tilde{p}(x)} \\ &=& \sum_{x}q(x)\log\frac{q(x)}{p(x)} - \log{Z(\theta)} \\ &=& \mathcal{KL}\left(q\,||\,p\right) - \log{Z(\theta)} \end{eqnarray} \] And since \(\mathcal{KL}\left(q\,||\,p\right) \geq 0\) we can see that \[ \log{Z(\theta)} = \mathcal{KL}\left(q\,||\,p\right) - J(q) \geq -J(q) \quad . \] This implies that minimizing \(J(q)\) — the K-L divergence from the approximate distribution \(q\) to the unnormalised posterior distribution \(p\) — is equivalent to maximizing a lower bound on the log-liklihood of the data.

Illustration showing that the K-L divergence is not symmetric. Here we are fitting an approximate distribution \(q\) (red) to a target distribution \(p\) (blue). Using \(\mathcal{KL}(p||q)\) leads to a \(q\) that tries to cover both modes (a). However, using \(\mathcal{KL}(q||p)\) forces \(q\) to choose one of the two modes of \(p\) (b, c). [source]

Now that we have our objective function, what distribution family is suitable to approximate a posterior density? (Or any density, for that matter?) Generally we want our distribution family to have nice properties, in that we want it to have closed-form (analytic) marginalisations, be smooth and continuous everywhere, infinitely differentiable, and computationally cheap to evaluate. For now let us assume that we use something like a normal distribution for our variational family. In principle now we have our objective function and our variational family, and there are many advantages (and some disadvantages) to using variational inference. Let's just summarise them so that you get a better picture of where we are going with this.

Advantages of variational inference

  1. Replace an expensive sampling operation with a cheap optimisation operation.
  2. Can approximate a posterior distribution very cheaply.
  3. Can interrogate (criticise) the model and update it with fast turnaround.
  4. Scales well to large data sets (that would be intractable).
  5. Enables things like Bayesian Neural Networks — with many layers — which gives our neural networks predictions with uncertainties, and uncertainty over our network architecture.
  6. Generalisable to many, many problems (any non-conjugate model).

Disadvantages of variational inference

  1. It's an approximate method: tends to under-estimate posterior variance and can introduce an optimisation bias.
  2. Non-convex optimisation procedure.
  3. Cannot capture correlations between model parameters.
  4. Posteriors are restricted in shape to that of the variational family.

For you, the biggest disadvantage to variational inference might be that you can't capture correlations between parameters in your posterior. Depending on your model parameterisation, that could be a big problem. There are many problems where we don't care about that though, including the entire field of mean field variational inference. Models like latent dirichlet allocation, or topic modelling, do not care about correlations between model parameters. We don't have time to talk about mean field variational inference here, so we are going to skip to doing automatic differentiation variational inference.

In automatic differentiation variational inference we are taking the variational family of latent variables and writing down the variational lower bound for the model, then using automatic differentiation during optimisation. If we can write down the variational lower bound then we can optimise the model and get closed form solutions for the approximate posterior of \(p\), in a fraction of the time it would take to sample. We still need to optimise this problem, and we can do this using stochastic optimisation (in TensorFlow, or any other automatic differentiation library), which allows us to scale our variational inference problem up to very large data sets.

The best variational inference package on the marketThat I know of. is Edward, which was built on-top of TensorFlow. In Edward you only need to specify your model, and there are a number of examples already available. Disclaimer: Edward is to be integrated into TensorFlow, and the current Edward version is only compatible with an older version of TensorFlow (1.2), but you can see how this code ought to work. If you want to run your own example of variational inference, check out this notebook for PyMC3.

Let's start with a very simple example. import numpy as np import matplotlib.pyplot as plt import tensorflow as tf from edward.models import Normal np.random.seed(0) x_train = np.linspace(-3, 3, num=50) y_train = np.cos(x_train) + np.random.normal(0, 0.1, size=50) x_train = x_train.astype(np.float32).reshape((50, 1)) y_train = y_train.astype(np.float32).reshape((50, 1)) W_0 = Normal(loc=tf.zeros([1, 2]), scale=tf.ones([1, 2])) W_1 = Normal(loc=tf.zeros([2, 1]), scale=tf.ones([2, 1])) b_0 = Normal(loc=tf.zeros(2), scale=tf.ones(2)) b_1 = Normal(loc=tf.zeros(1), scale=tf.ones(1)) x = x_train y = Normal(loc=tf.matmul(tf.tanh(tf.matmul(x, W_0) + b_0), W_1) + b_1, scale=0.1) qW_0 = Normal(loc=tf.get_variable("qW_0/loc", [1, 2]), scale=tf.nn.softplus(tf.get_variable("qW_0/scale", [1, 2]))) qW_1 = Normal(loc=tf.get_variable("qW_1/loc", [2, 1]), scale=tf.nn.softplus(tf.get_variable("qW_1/scale", [2, 1]))) qb_0 = Normal(loc=tf.get_variable("qb_0/loc", [2]), scale=tf.nn.softplus(tf.get_variable("qb_0/scale", [2]))) qb_1 = Normal(loc=tf.get_variable("qb_1/loc", [1]), scale=tf.nn.softplus(tf.get_variable("qb_1/scale", [1]))) # Now run inference. inference = ed.KLqp( { W_0: qW_0, b_0: qb_0, W_1: qW_1, b_1: qb_1 }, data={y: y_train} ) inference.run(n_iter=1000)

Here you can see that we have specified all of the network weights as TensorFlow variables. Then, instead of running any sampling, we use the ed.KLqp inference module which uses the K-L divergence as a metric for how close the posterior distribution and our approximate distribution need to be. That's it!

Here is an example for how we might construct a variational auto-encoder (using MNIST data) with Edward, allowing us to get approximate posteriors on the weights of the neural network! This example is derived from the Edward examples. import edward as ed import numpy as np import os import tensorflow as tf from edward.models import Bernoulli, Normal from edward.util import Progbar from observations import mnist from scipy.misc import imsave tf.flags.DEFINE_string("data_dir", default="/tmp/data", help="") tf.flags.DEFINE_string("out_dir", default="/tmp/out", help="") tf.flags.DEFINE_integer("M", default=100, help="Batch size during training.") tf.flags.DEFINE_integer("d", default=2, help="Latent dimension.") tf.flags.DEFINE_integer("n_epoch", default=100, help="") FLAGS = tf.flags.FLAGS if not os.path.exists(FLAGS.out_dir): os.makedirs(FLAGS.out_dir) def generator(array, batch_size): """Generate batch with respect to array's first axis.""" start = 0 # pointer to where we are in iteration while True: stop = start + batch_size diff = stop - array.shape[0] if diff <= 0: batch = array[start:stop] start += batch_size else: batch = np.concatenate((array[start:], array[:diff])) start = diff batch = batch.astype(np.float32) / 255.0 # normalize pixel intensities batch = np.random.binomial(1, batch) # binarize images yield batch def main(_): ed.set_seed(42) # DATA. MNIST batches are fed at training time. (x_train, _), (x_test, _) = mnist(FLAGS.data_dir) x_train_generator = generator(x_train, FLAGS.M) # MODEL # Define a subgraph of the full model, # corresponding to a minibatch of size M. z = Normal(loc=tf.zeros([FLAGS.M, FLAGS.d]), scale=tf.ones([FLAGS.M, FLAGS.d])) hidden = tf.layers.dense(z, 256, activation=tf.nn.relu) x = Bernoulli(logits=tf.layers.dense(hidden, 28 * 28)) # INFERENCE # Define a subgraph of the variational model, # corresponding to a minibatch of size M. x_ph = tf.placeholder(tf.int32, [FLAGS.M, 28 * 28]) hidden = tf.layers.dense(tf.cast(x_ph, tf.float32), 256, activation=tf.nn.relu) qz = Normal(loc=tf.layers.dense(hidden, FLAGS.d), scale=tf.layers.dense( hidden, FLAGS.d, activation=tf.nn.softplus)) # Bind p(x, z) and q(z | x) to the same TensorFlow # placeholder for x. inference = ed.KLqp({z: qz}, data={x: x_ph}) optimizer = tf.train.RMSPropOptimizer(0.01, epsilon=1.0) inference.initialize(optimizer=optimizer) tf.global_variables_initializer().run() n_iter_per_epoch = x_train.shape[0] // FLAGS.M for epoch in range(1, FLAGS.n_epoch + 1): print("Epoch: {0}".format(epoch)) avg_loss = 0.0 pbar = Progbar(n_iter_per_epoch) for t in range(1, n_iter_per_epoch + 1): pbar.update(t) x_batch = next(x_train_generator) info_dict = inference.update(feed_dict={x_ph: x_batch}) avg_loss += info_dict['loss'] # Print a lower bound to the average marginal # likelihood for an image. avg_loss /= n_iter_per_epoch avg_loss /= FLAGS.M print("-log p(x) <= {:0.3f}".format(avg_loss)) # Prior predictive check. images = x.eval() for m in range(FLAGS.M): imsave(os.path.join(FLAGS.out_dir, '%d.png') % m, images[m].reshape(28, 28)) if __name__ == "__main__": tf.app.run()

Summary

Machine learning — specifically neural networks — can be very useful for making predictions or by learning non-linear mappings from data. That's why they are universal interpolators. But frequently we also care about things that were not in the training set, and what predictions might arise for those kinds of objects. A standard neural network will always give us a prediction, even if the object was from outside the training set. And it can give us that prediction with a lot of confidence! Obviously this just tells ut that the notion of confidence in the output of a neural network is not akin to what we think when we think of probability. Instead it is a measure of relative belief that is conditioned upon the network, the training set, and a thousand decisions that you have made.

For all of these reasons we want our neural networks (in science) to have some degree of belief with what predictions they can make, and to recognise when there are inputs that are unusual from the training set. Bayesian Neural Networks are one way of doing this. Another, more flexible and powerful tool, is variational inference. Variational inference lets us turn a sampling problem (expensive) into an optimisation problem (cheap) by allowing for a family of distributions that can approximate the posterior. Variational inference can be used in lots of different fields of machine learning (see the Edward code example gallery) to make them efficient, or even tractable!

In our next (last) class, we will talk briefly about Recurrent Neural Networks and their applications within physics, before expanding on causality (or interpretability) in machine learning.

 

← Previous class
GANs, AEs, and VAEs
Next class →
RNNs and causality
Contributions

The animation showing the multiplication of kernels is from here.

The convolution examples are from Vincent Dumoulin and Francesco Visin (see here and here).