Exploring Model Expansion Interactively

The following notes attempt to explain the main ideas of my paper "Identifiability and Falsifiability: Two Challenges for Bayesian Model Expansion" using interactive plots and an emphasis on intuition. When I began this project, I wanted to understand the methodological consequences of model expansion.

model expansion

The process of adding components to a statistical model, usually increasing its flexibility to describe or generate different kinds data. E.g. adding an interaction term to a regression to capture variation in the response variable by subgroup.

Model expansion is often motivated by observing some feature of our observed data (e.g. as quantified by a test statistic) which cannot be reproduced in simulated data generated from our current model. This sort of model expansion is thus a natural component, explicitly or implicitly, of many statistical modeling workflows (see Gelman, et al. 2020; Box's loop in Blei 2014; and Betancourt 2020). In practice, expansion is usually achieved by adding new parameters to the base model. Some examples are given in the following table.

data behavior	new parameters
varying effect by group	interaction coefficients
individual-level effects	random intercepts & their population scale
temporally correlated error	autocorrelation coefficient(s)
variance > mean (positive data)	overdispersion parameter

In particular, expanding models in this way necessitates increasing the dimension of the parameter space. We should then ask what consequences this increasing dimension might have for our statistical methods. From a computational vantage point, it is well-understood how high dimensionality can hurt us (through curses of dimensionality) and how we can sometimes regain control in these settings (through high dimensional regularities like concentration of measure). As a result, computational methods (e.g. the Hamiltonian Monte Carlo sampler in Stan) have been carefully developed to handle these situations.

But what about our (seemingly) simpler tools for Bayesian inference and model checking? Take for example the default summary output from Stan, consisting of marginal posterior means and standard deviations. The practical usefulness of this output depends on the standard deviations being not too large, or our inferences may be too uncertain to support substantively interesting conclusions. Similarly, the usefulness of the posterior predictive $p$ -value for model checking depends on its power under nearby alternatives to our current model.

The primary goal of this project is to understand how such properties (marginal uncertainty, test power) scale under the process of model expansion. It is almost certainly impossible to give a fully general answer, as the relevance of any theoretical analysis will depend on how faithfully the assumptions of that analysis map onto whatever applied modeling scenario we find ourselves in. With these necessary limitations in mind, my main argument is that model expansion tends to increase our uncertainty about parameters and decrease our power to detect misfit.

A Simple Regression

The main ideas of this argument can be presented in an extremely simple regression example. Imagine that we begin with a simple linear regression of a single predictor:

y_{j} ∣ β_{1} \sim normal (β_{1} x_{1 j}, 1), β_{1} \sim normal (0, σ_{b}),

where $j \in {1, 2}$ , $x_{1} = (0, 1)$ , and $σ_{b}$ is taken large enough that our prior is weakly informative for the coefficient. Then, imagine that we expand this regression by adding a second predictor $x_{2}$ with coefficient $β_{2}$ independent of $β_{1}$ and with an identical prior:

y_{j} ∣ β_{1}, β_{2} \sim normal (β_{1} x_{1 j} + β_{1} x_{2 j}, 1), β_{1}, β_{2} \sim ii d normal (0, σ_{b}),

The following interactive plots examine the behavior of the expanded model under different assumptions about the correlation between $x_{1}$ and $x_{2}$ . All distributions are recentered at 0 to make for easier comparisons of their scales. The left panel plots the marginal posterior against the marginal prior for $β_{1}$ . The right panel plots the sampling distribution for an independent replication/resample $y_{1}^{rep}$ of $y_{1}$ at the posterior mean of the parameters, i.e. $p (y_{1}^{rep} ∣ \hat{β}_{1}, \hat{β}_{2})$ against the posterior predictive distribution of $y_{1}^{rep}$ , i.e. $p (y_{1}^{rep} ∣ y_{1}, y_{2})$ .

same as base model

▸ predictor correlation

The comparison in the left panel is clearly related to our ability to make useful inferences about $β_{1}$ , since a posterior that is much narrower than the prior represents a greater information gain from observing the $y_{j}$ . It is likely less clear how the comparison in the right panel relates to the power of model diagnostics. Such a connection can be argued formally, but here we will focus on intuition. Note that if the sampling distribution $p (y_{1}^{rep} ∣ β_{1}, β_{2})$ had no dependence on the parameters, then the two distributions in the right panel would be equal and we would have no uncertainty about the true sampling distribution (assuming model correctness). Likewise, dissimilarity between the distributions in the right panel indicates greater posterior uncertainty about the true sampling distribution for $y_{1}$ . When this uncertainty is large, the model contains many different plausible descriptions of the data generating process. Thus, in any check of the model, it has as many different attempts to explain the observed data, decreasing the power of the check by making it easier for the model to pass.

Thus, for purposes of parameter identification, we want the two distributions in the left panel to be as dissimilar as possible; and for purposes of model checking, we want the two distributions in the right panel to be as similar as possible. Varying the predictor correlation from 0 to 1, you can see that the best case for both of these goals recovers the relationships between the plotted distributions that exist in the base model. At all other correlations, the relationships between the plotted distributions are worse than in the base model. Furthermore, the best-cases for each panel occur at opposite extremes: 0 correlation is best for parameter identification and total correlation is best for model checking. Thus, in this simple case, model expansion degrades the performance of our basic inference and evaluation tools, and expansions which limit the damage to inference tend to more severely damage evaluation, and vice versa.

Parameter Uncertainty and Model Expansion

In general, expansion will not inevitably degrade inference and evaluation as it does in this example, but the general tendency will remain. In order to see this generalization, we must first give a more formal definition of model expansion.

model expansion (formal)

Let $p_{base} (y, θ)$ be a joint model of data $y \in R^{m}$ and parameters $θ \in R^{d}$ , and let $p (y, θ, λ)$ be a model defined with additional parameters $λ \in R^{L}$ . Then we say that $p$ is an expansion of $p_{base}$ if there is some value of $λ_{0} \in [- \infty, \infty]^{L}$ such that $p_{base} (y, θ) = p (y, θ ∣ λ_{0}) .$ If $λ_{l} = \pm \infty$ for some $l \in [L]$ , then the above equality must be understood in terms of the appropriate limits.

Intuitively, this definition just requires that an expanded model embed the base model as a special case. This gives us a well-defined way to talk about a base and expanded model as sharing some parameters (i.e. $θ$ in the above definition), allowing us to draw direct comparisons with respect to these parameters. Furthermore, all of the examples of model expansion that were given in the table above can be seen to fit this formal definition.

It follows immediately from the definition that the posterior distribution of an expanded model also includes the posterior of the base model as a special case, again just by conditioning on $λ_{0}$ . With this in mind, we can see how expanded models tend to exhibit increased uncertainty for the parameters they share with the base model. The following interactive figure demonstrates this idea. The top panel displays a joint posterior distribution between the new parameters $λ$ (y axis) and the shared parameters $θ$ (x axis). The controls allow adjusting both the dependence of the conditional means

E [θ ∣ λ, y]

on $λ$ as well as the marginal posterior $p (λ ∣ y)$ . The base model's posterior distribution is obtained by conditioning on $λ = 0$ . This distribution is constant across the various attainable joint posterior distributions, so that all such joint distributions represent expansions of the same base model. Finally, the bottom panel displays the marginal posterior distributions over $θ$ for the base an expanded model, and the heatmap next to the controls represents the difference in uncertainty between these two distributions. Blue values represent cases where the expanded model has greater uncertainty, and red values represent cases where the base model has greater uncertainty (where uncertainty is measured by the information entropy of the marginal distributions).

▸ mean dependence

▸ joint spread

As is apparent from examining the changes in the marginal for different configurations of the joint posterior (or, equivalently, from the heatmap), most of the possible expanded models exhibit greater uncertainty about $θ$ than the base model. Furthermore, this visualization shows us why we should expect this to be the case in general. Suppose we shift the marginal posterior $p (λ ∣ y)$ towards values for which the conditional posteriors $p (θ ∣ λ, y)$ have greater confidence (i.e. we move up on the heatmap). Even for such marginal posteriors, we can still easily end up with greater marginal uncertainty about $θ$ since the aforementioned conditional posteriors need not place the majority of their probability over the same intervals. Visually, this is seen by then increasing the dependence of the conditional expectations on $λ$ (i.e. by moving to the right on the heatmap).

Sampling Distribution Uncertainty and Model Expansion

Another simple example illustrates why model uncertainty about the (true) sampling distribution $p (y ∣ \cdot)$ tends to increase through model expansion. For our base model, we assume the components of $y$ are sampled i.i.d. from a normal population with known scale:

y_{j} \sim ii d normal (θ, 1), θ \sim normal (0, 1) .

We then expand this model by allowing the population scale to be unknown as well, arriving at the specification:

y_{j} \sim ii d normal (θ, σ), (θ, σ^{- 2}) \sim NG (0, μ_{σ^{2}}, 2, μ_{σ^{2}}) .

The $NG$ prior is the Normal-Gamma prior, which is conjugate for this problem. The marginal prior on $θ$ matches the prior in the base model, and $μ_{σ^{2}}$ is the prior mean of the population variance $σ^{2}$ . The (average) accuracy with which we can estimate the unknown mean $θ$ is controlled by the average "noise level" $r = μ_{σ^{2}} / J$ , where $J$ is the sample size (i.e. the dimension of $y$ ). With this setup, we can compare the posterior uncertainty about the true sampling distribution between base and expanded models. The following figure plots the change in this uncertainty from base to expanded model against the sample size. The average noise level can be adjusted by moving the slider.

▸ average noise level

0.1

Two features are immediately apparent from the plot. First, the difference tends to increase with the sample size, such that it is very difficult to achieve a negative difference for larger sample sizes. Second, to achieve negative differences for even moderate sample sizes, the average noise level must be substantially larger than 1. We can make sense of this by dividing the change in sampling distribution uncertainty into two parts. On one hand, for any fixed distribution over the population mean $θ$ , increasing the typical size of $σ$ will "blur" the distributions $p (y ∣ θ, σ)$ together (as the differences in their means will be swamped by their larger variance). This makes the plausible sampling distributions more similar and puts downward pressure on the plotted uncertainty difference. Symmetrically, this effect could equally put upward pressure on the uncertainty difference if the typical size of $σ$ were to decrease.

On the other hand, including $σ$ as a parameter creates an additional degree of freedom in terms of which the sampling distributions $p (y ∣ θ, σ)$ may vary. In other words, for any value of $θ$ , there was only one possible sampling distribution in the base model, but there is a potentially wide variety of possibilities in the expanded model, depending on the value of $σ$ . This additional degree of freedom places upward pressure on the uncertainty difference.

It is this effect of the additional degree of freedom that accounts for the fact that the uncertainty difference is almost always positive in this plot. Adjusting the slider, it is easy to see that the noise level must be much high than 1 (which is the noise level in the base model with $J = 1$ ) in order to offset this additional degree of freedom and push the uncertainty difference into negative territory. Furthermore, it is this addition of degrees of freedom which creates the tendency for sampling distribution uncertainty to increase in general.

An Uncertainty Tradeoff in General

In the last two examples, I have attempted to convey intuition for why uncertainty about both the parameters and sampling distribution tends to increase through model expansion. However, in both cases, the increase in uncertainty was not inevitable, and we might hope that if we are sufficiently alert or clever, we can avoid the worst of these consequences by choosing the more favorable expansions (assuming these are compatible with our understanding of the world). This is where the third feature of our first example comes in the play, namely the apparent tradeoff between parameter and sampling distribution uncertainty. At this point, you may rightly be wondering whether this feature of the regression example also generalizes to other cases reliably. A simple heuristic that suggests this pattern should hold with some generality is given by the observed Fisher information with respect to $θ$ . In particular, suppose we have a scalar $θ$ and consider how this quantity varies with $λ$ in the expanded model:

- \frac{\partial ^{2}}{\partial θ ^{2}} lo g p (y ∣ θ, λ)

If, for typical $(λ)$ , this quantity is larger than the observed Fisher information in the base model, then we expect relatively lower marginal uncertainty about $θ)$ in the expanded model. But large values of this second derivative also indicate that the sampling distribution $p (y ∣ θ, λ)$ changes rapidly with $λ$ , increasing the "strength" of the additional degree of freedom and putting upward pressure on the uncertainty about the sampling distribution. And clearly the reverse tradeoff exists if this second derivative is smaller than the base model's observed information. Furthermore, if $θ$ is a vector, then we can make similar arguments involving the spectrum of the corresponding information matrices. It is of course interesting to ask when these ideas can be made rigorous, and the main argument of my paper presents a first attempt at doing so.

Sampling Distribution Uncertainty and Model Checking

I will close this post with a few words on why we should care about our uncertainty about the true sampling distribution. In fact, I think there are a few distinct reasons, but for now I will focus on how large levels of uncertainty can produce undesirable behavior in the posterior predictive $p$ -value. Again my focus will be on the intuition that we can extract from a simple model. Consider modeling two scalar data points $y_{1}, y_{2}$ as an i.i.d. sample from a $t$ distribution with fixed degrees of freedom $d$ and unknown location $θ$ .

y_{1}, y_{2} \sim ii d t_{d} (θ), θ \sim uniform ([- 15, 15]) .

We will assume our data points are of the form $(y_{1}, y_{2}) = (- y, y)$ for various levels of $y > 0$ . As $y$ increases, the model becomes increasingly misspecified, as the sampling distribution has fixed variance equal to $1$ . This is evident from the top-right panel of the following figure, which plots the observed data point against the posterior predictive distribution of the model. As we increase the magnitude of $y$ with the slider, the observed data becomes increasing atypical under the posterior predictive distribution.

▸ data magnitude

0.5

We can then compute posterior predictive $p$ -values $p_{T}$ for the test statistics $T (y_{1}, y_{2}) = y_{1}$ and $T (y_{1}, y_{2}) = y_{2}$ . By symmetry, these $p$ -values are equal, and are indicated by the dashed line in the top-left panel. These are also plotted against the magnitude of $y$ in the bottom panel. We observe two unpleasant features of these $p$ -values. First, they never drop below $0.05$ , a common nominal threshold for rejection. Furthermore, they actually begin increasing again after the magnitude of $y$ is large enough, even though the overall model fit becomes worse.

A clearer picture of the model fitness is painted by the conditional $p$ -values $p_{T} (θ)$ , defined as the tail probabilities with respect to the individual sampling distributions $p (y_{1}^{rep}, y_{2}^{rep} ∣ θ)$ rather than the posterior predictive $p (y_{1}^{rep}, y_{2}^{rep} ∣ y_{1}, y_{2})$ . As the notation suggests, these depend on $θ$ , and the top-left panel plots these against $θ$ for each of our two test statistics. From this we can see that the usual $p$ -value remains above our rejection threshold because each statistic becomes highly typical at the extremes of one or the other posterior mode as $y$ becomes large. But both statistics are vanishingly unlikely over most of the posterior mass, and there are no values of $θ$ for which the statistics are simultaneously typical for $y$ greater than $4$ .

While all of the above observations were already obvious from the top-right panel, examining these conditional $p$ -values reveals that the poor performance of the posterior predictive $p$ -value can be explained by the variation in the model fitness over $θ$ . Such extreme variation in fitness over the parameters is only possible when uncertainty about the true sampling distribution is sufficiently large. Said another way, if the sampling distributions that were plausible under the posterior were all very similar, then the conditional $p$ -values computed with respect to them could not vary so severely over the support of the posterior. Only in cases like this where sampling distribution uncertainty is large can severe misfit hide in local regions of parameter space.