Model comparison and decision making

Data analysis and machine learning

All models are wrong. Some are useful.

Statistics will never tell you whether the model you have is the true model or not. As a Bayesian all you can do is compare models.

Comparing maximum likelihoods

Consider the straight line model that we adopted last week. In the third class we found a model that seemed satisfactory. Most of the data were consistent with being generated from a straight line model, and a few of the data points were consistent with being outliers: drawn from some other distribution. How do we know that model was any good? We could use a model that is just as "simple" in that it has the same (or fewer!) parameters. Suppose that we could then calculate the log likelihood for our new model, and the log likelihood for our simpler model is higher than the maximum likelihood we ever found for our straight line model. Is the new model preferred? What is that model?

Just because a model has a higher log likelihood doesn't mean that model is necessarily better. A more complex model has more degrees of freedom, and so it may always fit the data better. And a simpler model isn't necessarily better either because it may make very poor predictions elsewhere: it is over-fitting the data.

$ \newcommand\vectheta{\boldsymbol{\theta}} \newcommand\vecdata{\mathbf{y}} \newcommand\vecx{\boldsymbol{x}} \newcommand\model{\boldsymbol{\mathcal{M}}} $

(Re-)Introducing the "evidence"

What we really want is to do is to compare the posteriors of two models. That is we want to compare the model posteriors $p(\model_1|\vecdata)$ and $p(\model_2|\vecdata)$. From Bayes' theorem we can describe the model posterior $p(\model|\vecdata)$ as \[ p(\model|\vecdata) = p(\vecdata|\model)\frac{p(\model)}{p(\vecdata)} \quad . \] Now recall from the definition of conditional probability \[ p(\phi|\psi) = \int p(\phi|\theta,\psi)p(\theta|\psi) d\theta \] we can re-write $p(\vecdata|\model)$ as \[ p(\vecdata|\model) = \int_\vectheta p(\vecdata|\vectheta,\model)p(\vectheta|\model) d\vectheta \] which you will recall is exactly the (unnormalised) integral we have been trying to approximate using MCMC. The term $p(\model)$ is our prior belief for the model $\model$, which may be uninformative. The remaining term $p(\vecdata)$ is the prior probability of observing the data without reference to any model. That is frightening to think about let alone calculate it, so we will avoid it by computing the odds ratio between two alternative models and force $p(\vecdata)$ to cancel out \[ O_{21} \equiv \frac{p(\model_2|\vecdata)}{p(\model_1|\vecdata)} = \frac{p(\vecdata|\model_2)}{p(\vecdata|\model_1)}\frac{p(\model_2)}{p(\model_1)} \] \[ O_{21} \equiv \frac{p(\model_2)}{p(\model_1)}\frac{\int_{\vectheta_2}p(\vecdata|\vectheta_2,\model_2)p(\vectheta_2|\model_2)d\vectheta_2}{\int_{\vectheta_1}p(\vecdata|\vectheta_1,\model_1)p(\vectheta_1|\model_1)d\vectheta_1} \] Now this is tractable.In some cases. We just need to integrate over the posterior for each model,Sounds so easy, right? and come up with a prior odds for each model. If you have no strong prior belief for one model over another the odds ratio is often left as unity, reducing the odds ratio to \[ O_{21} = \frac{p(\vecdata|\model_2)}{p(\vecdata|\model_1)} = \frac{\int_{\vectheta_2}p(\vecdata|\vectheta_2,\model_2)p(\vectheta_2|\model_2)d\vectheta_2}{\int_{\vectheta_1}p(\vecdata|\vectheta_1,\model_1)p(\vectheta_1|\model_1)d\vectheta_1} \] which people often refer to as the Bayes factor.

If you are still feeling optimistic about Bayesian model selection, that feeling is about to disappear. The integral we need to compute is one over the entire parameter space of each model. That can be extremely computationally intensive, even if the model only has a handful of parameters.

Let's code up an example. First we will generate some data from a polynomial and add Gaussian errors to the observations \[ y \sim \mathcal{N}\left(\theta_0 + \theta_{1}x + \theta_{2}x^2, \sigma_y\right) \] In code, this looks something like: import numpy as np import matplotlib.pyplot as plt from matplotlib.ticker import MaxNLocator np.random.seed(1) # Some MAGIC numbers N = 20 x_scale = 100 y_err_intrinsic = 1.75 theta_scales = np.logspace(-3, 1, 3) theta = theta_scales * np.random.uniform(size=3) x = x_scale * np.random.uniform(size=N) x = np.sort(x) y_true = np.polyval(theta, x) # Add noise in the y-direction. y_err = y_err_intrinsic * np.random.randn(N) y = y_true + np.random.randn(N) * y_err y_err = np.abs(y_err) # We will make a bunch of scatter plots here, # so let's not duplicate code. def scatterplot(x, y, y_err): fig, ax = plt.subplots(figsize=(4, 4)) ax.scatter(x, y, c="k", s=10) if y_err is not None: ax.errorbar(x, y, yerr=y_err, fmt="o", lw=1, c="k") ax.set_xlabel(r"$x$") ax.set_ylabel(r"$y$") ax.set_xlim(0, 100) ax.set_ylim(-5, 20) ax.xaxis.set_major_locator(MaxNLocator(6)) ax.yaxis.set_major_locator(MaxNLocator(6)) fig.tight_layout() return (fig, ax) # Draw the data without noise. fig, ax = scatterplot(x, y_true, y_err=None) Here's what the data look like without any noise, where you can see the curvature of the polynomial:

Now let's plot the data with errors added in the $y$-direction: # Draw the data with noise. fig, ax = scatterplot(x, y, y_err)

Now imagine if someone gave you this data — without knowing how they were generated — and you were asked:

Are these data better described by a straight line or a polynomial?

Let's calculate the Bayes factor to find out. First we will need to build two models (a straight line model, and a polynomial model), sample the posteriors in each model, and then integrate those posteriors. You could use the Hamiltonian Monte Carlo sampler that you wrote as part of your assignment to sample both models, but normally it is a good idea to build your own sampler to understand how it works (including tuning, diagnostics, etc), and then use a well-engineered implementation for Science™. Here we will use pyMC, a package for doing probablistic programming-esque in Python.I use -esque because Python is not a "probabilistic programming" language. Instead pyMC uses the Theano deep learning package to analytically compute gradients, allowing you to use gradient-based MCMC algorithms (and other things) and letting you focus on the model building. Other probabilistic programming languages exist (e.g., Stan and PyStan) and are probably better for specific problems.

Let's build the straight line model first: import pymc3 as pm # The parallel sampling code in PyMC can # cause issues for some installations. cores = 1 samples = 1000 with pm.Model() as line_model: theta_0 = pm.Normal('theta0', mu=0, sd=100) theta_1 = pm.Normal('theta1', mu=0, sd=100) mu = theta_0 + theta_1 * x po = pm.Normal( 'y', mu=mu, sd=y_err, observed=y ) start = pm.find_MAP() line_trace = pm.sample( samples, start=start, cores=cores )

Ignore the priors we have put on $\theta_0$ and $\theta_1$: they are intentionally set to be extremely broad and to have no impact on our inferences. Let's assume that the chains have convergedAt our peril! Only because this is an instructive exercise!and visualise the posteriors by making a corner plot: # Make a corner plot. from corner import corner line_chains = np.vstack([line_trace.get_values(pn) \ for pn in line_trace.varnames]).T fig = corner(line_chains)

And now let's draw samples from the posterior and make posterior predictions in the data space: fig, ax = scatterplot(x, y, y_err) xi = np.linspace(x.min(), x.max(), 100) for idx in np.random.choice(samples, 100): p = [line_trace["theta1"][idx], line_trace["theta0"][idx]] ax.plot(xi, np.polyval(p, xi), c="#2ca02c", alpha=0.1)

Looks reasonable to me. What do you think? If you didn't know that the data were truely generated by polynomial model, you might conclude from this figure that the data were drawn from a straight line.

Let's build the polynomial model in pyMC: with pm.Model() as poly_model: theta_0 = pm.Normal('theta0', mu=0, sd=100) theta_1 = pm.Normal('theta1', mu=0, sd=100) theta_2 = pm.Normal('theta2', mu=0, sd=100) mu = theta_0 + theta_1 * x + theta_2 * x**2 po = pm.Normal( 'y', mu=mu, sd=y_err, observed=y ) start = pm.find_MAP() poly_trace = pm.sample( samples, start=start, cores=cores ) And make a corner plot of the posterior samples. Here we will mark the true values in blue, even though in the Real World™ we won't ever know the true values! poly_chains = np.vstack([poly_trace.get_values(pn) \ for pn in poly_trace.varnames]).T fig = corner( poly_chains, truths=theta[::-1], labels=[r"$\theta_0$", r"$\theta_1$", r"$\theta_2$"] )

Make a posterior predictive plot, and compare it to that of the straight line model: fig, ax = scatterplot(x, y, y_err) xi = np.linspace(x.min(), x.max(), 100) for idx in np.random.choice(samples, 100): p = [poly_trace[f"theta{i}"][idx] for i in (2, 1, 0)] ax.plot(xi, np.polyval(p, xi), c="#ff7f0e", alpha=0.1)

Shrug. They both kind of look good to me. What do you think?

Let's call the straight line model $\model_1$ and the polynomial model $\model_2$, and calculate the Bayes factor $\textrm{BF}_{21}$ \[ \textrm{BF}_{21} = \frac{p(\vecdata|\model_2)}{p(\vecdata|\model_1)} = \frac{\int_{\vectheta_2}p(\vecdata|\vectheta_2,\model_2)p(\vectheta_2|\model_2)d\vectheta_2}{\int_{\vectheta_1}p(\vecdata|\vectheta_1,\model_1)p(\vectheta_1|\model_1)d\vectheta_1} \] which means we must integrate over the posteriors in both model. We will do this numerically. Do the straight line model first. Note that even if we do this numerically, we still have to decide on the limits of the integration. Thankfully we can use the limits of the MCMC draws as the limits of the integration, since we are assuming that the MCMC chains have converged and that the samples drawn are representative of the posterior. # Integrate over the posterior samples. import scipy.stats from scipy import integrate # For speed purposes we will write a faster probability function # (Note: why is this wrong!?) def p_1(theta0, theta1): return np.exp(np.sum(scipy.stats.norm.logpdf( theta0 + theta1 * x, loc=y, scale=y_err ))) limits = list(np.array([ np.min(line_chains, axis=0), np.max(line_chains, axis=0) ]).T) Z_line, Z_line_err = integrate.nquad(p_1, limits) print(f"Z_line: {Z_line:.1e} (+/- {Z_line_err:.1e})")

And now do the polynomial model: # Why is this wrong!? def p_2(theta0, theta1, theta2): return np.exp(np.sum(scipy.stats.norm.logpdf( np.polyval([theta0, theta1, theta2][::-1], x), loc=y, scale=y_err ))) limits = list(np.array([ np.min(poly_chains, axis=0), np.max(poly_chains, axis=0) ]).T) # The next line could be expensive! Z_poly, Z_poly_err = integrate.nquad(p_2, limits) print(f"Z_poly: {Z_poly:.1e} (+/- {Z_poly_err:.1e})")

From sampling both models and integrating over the posterior samples, you should see output something like this: logp = -40.469, ||grad|| = 1.1614e-11: 100%|█████████████████████████████████████████| 12/12 [00:00<00:00, 1488.97it/s] Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (2 chains in 1 job) NUTS: [theta1, theta0] Sampling chain 0, 0 divergences: 100%|████████████████████████████████████████████| 1500/1500 [00:05<00:00, 293.62it/s] Sampling chain 1, 0 divergences: 100%|████████████████████████████████████████████| 1500/1500 [00:02<00:00, 525.78it/s] logp = -100.59, ||grad|| = 1,825.1: 100%|████████████████████████████████████████████| 20/20 [00:00<00:00, 1878.33it/s] Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Sequential sampling (2 chains in 1 job) NUTS: [theta2, theta1, theta0] Sampling chain 0, 0 divergences: 100%|█████████████████████████████████████████████| 1500/1500 [00:15<00:00, 98.92it/s] Sampling chain 1, 0 divergences: 100%|████████████████████████████████████████████| 1500/1500 [00:09<00:00, 158.04it/s] Z_line: 1.9e-16 (+/- 2.2e-15) Z_poly: 3.1e-17 (+/- 9.3e-12) Bayes factor is: 0.2

Whoa wait a minute! We know that the data were drawn from the polynomial model, and our posterior samples of the polynomial model are pretty much centered on the true values. So clearly our model fitting is working correctly. But the Bayes factor for the true polynomial model is $\approx\frac{1}{5}$, which implies that the straight line model is favoured by about fives times more than the true polynomial model.

Why is that? Did we do something wrong? When do you think the true polynomial model would be favoured over the straight line model?

<discuss>

Change the MAGIC numbers (as discussed) and re-run the code. How do your conclusions change? How much cheaper or more expensive do your integrals become?

Bayesian Information Criterion

Since the integrals above can be computationally intractable for realistic problems, many have sought to find alternatives. In practice these are all essentially approximations of the evidence for a model, in that the evidence (or normalisation constant) is what you need to compare models, but it's intractable.

The most commonly employed heuristic is that of the Bayesian Information Criterion, \[ \textrm{BIC} = D\log{N} - 2\log\hat{\mathcal{\mathcal{L}}}(\vecdata|\hat{\vectheta}) \] where $D$ is the number of model parameters, $N$ is the number of data points, and $\hat{\mathcal{L}}$ is the maximum log likelihood found for that model. For example, if you had two models $\model_1$ and $\model_2$, with $D_1$ and $D_2$ model parameters, where each are fit to the same data $\vecdata$ (which is of size $N$), then you will calculate two BIC values \[ \textrm{BIC}_1 = D_1\log{N} - 2\log\hat{\mathcal{\mathcal{L}}_1}(\vecdata|\hat{\vectheta}_1) \] \[ \textrm{BIC}_2 = D_2\log{N} - 2\log\hat{\mathcal{\mathcal{L}}_2}(\vecdata|\hat{\vectheta}_2) \] and you may conclude to prefer the model with the lowest BIC.

Don't be fooled by the name: there is very little Bayes' in this criterion. Under certain conditions it will reduce to something that well-approximates the evidence, but those conditions are rarely met.

The logic of the BIC is that you want a model that fits the data well, with as few parameters as possible. Naively that makes sense: we should restrict the number of parameters because that increases the flexibility and the degrees of freedom. But in practice adding a single parameter tells us very little about how the curvature of the posterior changes. The new parameter could simply be added to the existing model, in which case it makes no substantive change to the log posterior surface. Alternatively it may be introduced as an exoponent on some other model parameter, in which case it will absolutely change the curvature of the posterior. The BIC does not account for those two differences: it only counts numbers of model parameters.

Other information-theoretic approaches do exist, including those that account for such curvature (through the expected Fisher information matrix), but they are outside the scope of this course.

Bayesian decision analysis

Often we are faced with the situation where we don't necessarily need to compare models (over their parameter space), but we need to make decisions. Each decision has some associated utility (e.g., what do we want to achieve) and some associated risk, or risk probabilities for different events. And often the utility of one individual will differ from that of the population, even if their ultimate goals are the same. Let's explore some examples.

There may not be time for this topic. If so it will be done ad-hoc in class with notes (potentially) added later.

← Previous class
Markov chain Monte Carlo sampling

Next class →
Gaussian process models