Prior Validation Via Prior Predictive Checks
Let’s explore how prior predictive checks can help you understand whether your priors are reasonable before you see any data. This is an important step in the Bayesian workflow that is often overlooked, but it can save you from fitting models that encode assumptions you never intended.
Prior predictive checks: seeing what your priors actually say
After defining a Bayesian model, the immediate next step that presents itself to you is to specify the parameters’ prior distributions. There are many ways to do this, and unless you have a large amount of data, the prior you choose will forever leave an imprint on your posterior inferences. How you select your prior is a difficult question, and there is no one right answer. Perhaps you read a textbook recommendation, or you picked something vague because you did not want to be too informative. Either way, you now have a number sitting in your model definition, and you are about to condition on data. Before you do, there is a question worth asking - “what does your model believe before it has seen anything?”.
A prior distribution over parameters implies a distribution over observable data. Placing a
For a single-parameter linear model with an intercept, a slope, and a noise term, the mapping from priors to predictions is simple enough that you may be able to reason about the priors’ validity by inspection. But as models grow in complexity, it quickly becomes difficult to intuit prior validity simply by inspection as parameters will interact with each other and the likelihood in ways that are difficult to understand without simulation. A prior predictive check is a way to simulate from your model before it sees data, giving you a glimpse into what your model actually believes.
The procedure to conduct a prior predictive check is straightforward: sample parameter values from your priors, propagate those samples through the likelihood, and collect the resulting simulated observations. Each draw gives you a fake dataset, one that your model considers plausible before any real data enters the workflow. If those fake datasets look nothing like anything you could plausibly encounter in practice, your priors encode assumptions you did not intend.
The exercise reframes how we think about prior selection. Instead of asking “what do I believe a resonable value for this parameter could be?” you instead ask “what data would this model generate and do I believe it?” In all but the simplest of models, the second question is significantly easier to answer as it is grounded in the observable world, where you have genuine domain knowledge. Consider the case that you are modelling human weights. You know that weight values are strictly positive and do not reach into the tens of thousands. You know that a stable economic indicator does not double overnight. These are constraints on data, not on parameters, and the prior predictive distribution is the bridge between the two.
The rest of this post works through two examples. The first is a linear regression, chosen because the model is familiar and the pathologies are easy to see. The second is an AR(1) time series model, where the mapping from parameters to observations is less obvious and the need for simulation is more acute. My hope is that, through these examples, you will see how useful a prior predictive check can be, and you can begin incorporating it into your own Bayesian workflow.
Linear regression
Suppose you want to predict a person’s weight in kilograms from their height in centimetres. You have a rough sense of the data: heights cluster between 140 and 200 centimetres, and weights fall somewhere between 40 and 150 kilograms. This is a simple linear regression with an intercept, a slope, and a noise term.
Below is a first attempt at the model in NumPyro. The priors are not unusual, and there’s nothing fundamentally “wrong” in specifying an uninformative model such as the below.
| |
When considering each prior independently, everything looks reasonable. The intercept prior is centred at zero with a standard deviation of 100. The slope prior is the same. The noise prior is a half-normal with scale 50. None of these numbers are obviously wrong if you think about them one parameter at a time.
Now run the prior predictive check. Generate a grid of plausible heights and draw 500 synthetic datasets from the prior.
| |
The object prior_samples is a dictionary. The key "weight" holds a two-dimensional array with 500 rows and 50 columns, giving one prior predictive draw per row at each of the 50 height values. Let’s plot these draws against height and look at what the model considers plausible.
| |

The result is a mess. The prior predictive draws fan out across thousands of kilograms in both directions. Some lines predict that a 170 cm person weighs negative 5,000 kilograms. Others predict 10,000 kilograms. The model, before seeing data, considers it just as likely that heavier people are shorter as the reverse. None of this resembles anything you would expect from a dataset of human weights.
The problem is not any single prior. It is how they combine. The intercept prior allows values far from any reasonable baseline weight. The slope prior allows enormous positive and negative slopes. The noise prior adds further spread on top of that. Together, they produce predictions that are physically meaningless.
Now revise the priors. The intercept represents the predicted weight at zero height, which is not a meaningful quantity for this problem. A better approach is to think about the predicted weight at the mean height in the dataset, roughly 170 cm. Centre the model there. The slope should encode the belief that each additional centimetre of height adds or subtracts a modest amount of weight, probably less than 2 kilograms. The noise should reflect the range of individual variation you would expect around the trend.
| |
Several things changed. The intercept prior is now centred at 60 kilograms with a standard deviation of 10. This reflects a belief that the average weight at 170 cm is somewhere between roughly 40 and 80 kilograms, which covers a wide range of populations. The slope prior is centred at 0.5 kg/cm with a standard deviation of 0.5, encoding a mild positive relationship while still allowing for zero or negative slopes. The noise prior is tightened to a half-normal with scale 10, reflecting expected person-to-person variation. The height has been mean-centred so that the intercept has a direct interpretation.
Run the prior predictive check again.
| |

The revised prior predictive draws tell a different story. The predicted weights now cluster between roughly 20 and 120 kilograms across the height range. The relationship is mostly positive, with some draws showing a flat or slightly negative slope. The spread around each line is modest. Nothing here violates what you know about human weight before looking at any particular dataset. That is exactly what you want.
The goal is not to make the priors correct. You do not know the true values of the parameters, and you are not trying to guess them. The goal is to make the priors consistent with what you already know. If your model, before seeing data, generates predictions that include negative body weights or weights in the tens of thousands, it is encoding something you did not intend.
AR(1) time series
The second example uses a first-order autoregressive model. Before writing any code, it is worth understanding what this model does and what the parameters control.
An AR(1) process generates a sequence of observations where each value depends on the previous one. The model has three parameters: the autoregressive coefficient, which governs how strongly each observation depends on its predecessor; the innovation variance, which controls the size of the random shocks at each time step; and a mean level around which the process fluctuates.
The autoregressive coefficient carries a specific structural requirement. If it lies between -1 and 1, the process is stationary, meaning it has a stable long-run distribution and will fluctuate around the mean without drifting off or exploding. If the coefficient equals or exceeds 1 in absolute value, the process is non-stationary: it can wander arbitrarily far from the mean or oscillate with growing amplitude. Whether you want stationarity depends on the problem, but in many applied settings you expect stationarity, and your priors should reflect that.
Here is a NumPyro model with naive priors.
| |
The prior on rho is Normal(0, 1). This is a standard normal, which seems harmless. But roughly 32 per cent of draws from a standard normal fall outside the interval from -1 to 1. That means about a third of the prior predictive trajectories will be non-stationary. Some will explode. Some will oscillate with growing amplitude.
Draw prior predictive samples and plot the trajectories.
| |

A substantial fraction of the trajectories shoot off to extreme values. Some reach thousands or tens of thousands within a hundred time steps. Others look plausible, fluctuating around a central level with moderate variation. The plot is dominated by the explosive trajectories, which is itself informative: your model, before seeing data, places real probability mass on processes that grow without bound.
The difficulty here is that you cannot see this problem by looking at the prior on rho alone. A Normal(0, 1) distribution is unremarkable. The pathology only becomes visible when you simulate forward through time, because the non-stationarity compounds at each step. This is precisely the kind of situation where prior predictive checks earn their keep.
Revise the priors. If you expect a stationary process with moderate persistence, you want the autoregressive coefficient to lie within the stationary region and to concentrate around values that produce the kind of temporal dependence you consider plausible. A Uniform(-1, 1) prior restricts rho to the stationary region. If you have stronger beliefs, say that the process is positively autocorrelated with moderate persistence, a Beta distribution scaled to the interval from -1 to 1 can encode this more precisely. For this example, use a Uniform prior to keep things simple, and tighten the innovation variance.
| |

The revised trajectories are stationary by construction, since every draw of rho lies between -1 and 1. The trajectories fluctuate around their mean levels with moderate variation. Some are strongly autocorrelated, producing smooth, slowly varying paths. Others are weakly autocorrelated, producing noisier sequences. All of them remain bounded. This is what a reasonable set of prior beliefs looks like for many applied time series problems.
Closing reflection
Both examples follow the same pattern. Write a model. Simulate from the priors. Look at the synthetic data. Ask whether it looks like something you could plausibly observe. If it does not, revise the priors and simulate again.
Prior predictive checks are most informative when the mapping from parameters to observations is non-obvious. In a linear regression with one covariate, the relationship between priors and predictions is simple enough that an experienced modeller might catch the problem by inspection. But models grow. Parameters interact. Hierarchical structure introduces partial pooling effects that are difficult to reason about in your head. Likelihoods transform parameters through link functions, convolutions, or recursive dynamics. In these settings, simulation is not a convenience but a necessity.
A common misconception is that prior predictive checks are about getting the priors right in some objective sense. They are not. There is no single correct prior for a given problem, and reasonable analysts will make different choices. The purpose is narrower than that: to check whether your stated priors contradict what you already know before seeing data. A model that predicts negative body weights or explosive time series has priors that are inconsistent with basic domain knowledge, regardless of what a textbook says about uninformative defaults.
Making prior predictive checks a routine part of your workflow requires almost no additional effort. The code is a few lines: construct a Predictive object, draw samples, plot. The computational cost is negligible compared to inference. And the feedback you get, seeing what your model actually believes before it touches data, is something you cannot get any other way.
The same simulation-based thinking extends to later stages of the Bayesian workflow. After fitting the model, you can draw from the posterior predictive distribution and ask whether the fitted model generates data that resembles what you observed. You can perturb the priors and check whether your conclusions are sensitive to specific choices. These are all variations on the same idea: use simulation to understand what your model says, because models are too complex to understand by reading their definitions alone.