Your first Stan model: linear regression

R
Bayesian inference
Statistics
brms
Stan
cmdstanr
linear regression

In the previous post we peeked under the hood of brms and saw Stan staring back at us. Now it’s time to write a Stan model ourselves — from scratch, block by block. We’ll use a simple question about reading habits to walk through every part of a Stan program: data, parameters, and model. By the end, you’ll have run your first Stan model and compared its output to brms side by side.

Author

Sven De Maeyer

Published

March 24, 2026

📖 From brms to Stan — Post 2 of 8

This post is part of a series in which I work my way from brms to writing Stan models directly. Each post builds on the previous one, starting from a familiar brms model and gradually revealing the Stan code underneath.

# Title
1 Why Stan? The engine behind brms
2 Your first Stan model: linear regressionyou are here
3 Priors in Stan
4 Non-normal outcomes: logistic regression
5 Count data: Poisson and Negative Binomial
6 Multilevel models: random intercepts
7 Multilevel models: random slopes
8 Beyond brms: writing your own Stan model

In the previous post I made the case that you’ve been running Stan all along — brms just kept it politely hidden from you. We glanced at what make_stancode() produces and got a sense of what those mysterious blocks are for. But there’s a real difference between recognizing Stan code and writing it. Reading a recipe is not the same as making dinner.

Today we make dinner. The model we’ll build is about as simple as Bayesian models get: a linear regression with a single continuous predictor. Nothing fancy. But that simplicity is the point — I want every line of Stan code to feel obvious by the time we’re done, not intimidating. Once the skeleton is clear, everything else in this series is just adding more bones.

The dataset: reading at home, reading at school

The example I have in mind involves something educational researchers care a lot about: reading. Specifically, the question is whether children who read more at home also score higher on a standardized reading comprehension test at school.

Imagine a study with 120 fourth-grade pupils. For each child we recorded two things: the average number of minutes they read at home per day (reported by parents), and their score on a standardized reading comprehension test (0–100). The research question is simple:

Does daily reading time at home predict reading comprehension scores in fourth grade?

Let me simulate the data and take a look at it.

Show the code
set.seed(42)

n <- 120

reading_data <- tibble(
  pupil_id     = 1:n,
  reading_time = runif(n, min = 0, max = 60),
  reading_score = 40 + 0.4 * reading_time + rnorm(n, 0, 12),
  gender       = sample(c("girl", "boy", "non-binary"), n,
                        replace = TRUE, prob = c(0.47, 0.47, 0.06))
) %>%
  mutate(reading_score = pmin(pmax(reading_score, 0), 100))
Show the code
ggplot(reading_data, aes(x = reading_time, y = reading_score)) +
  geom_point(color = "darkgrey", alpha = 0.7, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "#D95F02", linewidth = 1) +
  labs(
    x = "Daily reading time at home (minutes)",
    y = "Reading comprehension score (0–100)"
  )

Reading time at home versus reading comprehension score for 120 fourth-grade pupils. Each dot is one child. The relationship looks encouraging — more reading at home, better scores at school.

There’s a positive trend, some noise, and no obvious disasters (no scores below 0, no structural weirdness). A clean setup for a first Stan model.

The brms starting point

Before we write a single line of Stan, let’s fit this model in brms and use it as our reference point. The model is:

\[ \text{reading\_score}_i \sim \mathcal{N}(\mu_i, \sigma) \]

\[ \mu_i = \alpha + \beta \cdot \text{reading\_time}_i \]

In brms:

Show the code
fit_brms <- brm(
  reading_score ~ reading_time,
  data    = reading_data,
  prior   = c(
    prior(normal(50, 20), class = Intercept),
    prior(normal(0, 1),   class = b),
    prior(exponential(1), class = sigma)
  ),
  warmup = 1000,
  iter   = 3000,
  chains = 4,
  cores  = 4,
  seed   = 42
)

saveRDS(fit_brms, "fitted_models/fit_brms.RDS")
Show the code
as_draws_df(fit_brms) |>
  select(b_Intercept, 
         b_reading_time, 
         sigma) |>
  posterior::summarise_draws()
# A tibble: 3 × 10
  variable       mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
  <chr>         <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 b_Intercept  37.3   37.3   1.94   1.92   34.1   40.5    1.00    8249.    6352.
2 b_reading_t…  0.476  0.476 0.0530 0.0535  0.390  0.561  1.00    7952.    5904.
3 sigma        10.4   10.4   0.645  0.644   9.41  11.5    1.00    8518.    5779.

So brms gives us an intercept around 40, a slope around 0.40, and a residual standard deviation around 12 — very close to the values I used to simulate the data. Now let’s see if we can get there ourselves.

Before we dive into our own Stan code, it’s always instructive to peek at what brms generates. This is the single most useful bridge between the two worlds:

Show the code
make_stancode(
  reading_score ~ reading_time,
  data    = reading_data,
  prior   = c(
    prior(normal(50, 20), class = Intercept),
    prior(normal(0, 1),   class = b),
    prior(exponential(1), class = sigma)
  )
)

The output is more elaborate than what we’ll write — brms adds extra infrastructure for things like prior predictive checks and log-likelihood computations. Our hand-written version will be leaner. That’s a feature, not a bug: we’re here to understand, not to match brms line for line.

The data block: what Stan needs to know upfront

A Stan program is divided into named blocks. The first one we’ll write is the data block. Its job is simple: it declares everything that comes in from R. Nothing is computed here — this is just a declaration of what variables exist and what types they are.

For our linear regression, Stan needs three things: N (the number of observations), reading_time (the predictor vector), and reading_score (the outcome vector).

Here’s what that looks like:

data {
  int<lower=0> N;               // number of observations
  vector[N] reading_time;       // predictor: daily reading time (minutes)
  vector[N] reading_score;      // outcome: comprehension score (0-100)
}

Let me break this down piece by piece.

int<lower=0> N declares an integer variable called N. The <lower=0> part is a constraint: Stan will throw an error if you accidentally pass a negative number of observations. This kind of defensive declaration is one of the things I like about Stan — it forces you to be explicit about what makes sense.

vector[N] reading_time declares a real-valued vector of length N. Notice that N is used in the vector declaration — Stan resolves this correctly because data is processed in declaration order. If we tried to use N before declaring it, Stan would complain. The ordering matters.

vector[N] reading_score is the same idea for the outcome. Both vectors contain real numbers (no integer constraint needed here), which is what we’d expect for continuous measurements.

Stan also has an int array type for integer data, which we’ll use in later posts when we move to binary and count outcomes. For now, vector is exactly right.

The parameters block: what Stan needs to estimate

The parameters block is where we declare the unknowns — the quantities that Stan will explore through its sampling algorithm. Every parameter that appears in our model needs to be declared here.

For a simple linear regression we have three parameters:

parameters {
  real alpha;              // intercept
  real beta;               // slope for reading_time
  real<lower=0> sigma;     // residual standard deviation
}

real alpha and real beta are unconstrained real-valued scalars. They can be positive or negative, which is appropriate for an intercept (scores can in principle be anywhere) and a slope (the relationship could go either way, in principle).

real<lower=0> sigma gets a lower bound of 0. Standard deviations are always positive — if sigma were allowed to go negative, the normal distribution would be undefined. Stan won’t let that happen. Internally, it applies a log transformation so that sampling happens on the unconstrained real line and then maps back. You don’t have to worry about this machinery, but it’s reassuring to know it’s there.

The model block: priors and likelihood

This is where the actual statistical model lives. We specify the prior distributions for each parameter, and the likelihood — how the data relate to the parameters.

model {
  // priors
  alpha ~ normal(50, 20);
  beta  ~ normal(0, 1);
  sigma ~ exponential(1);

  // likelihood (vectorized)
  reading_score ~ normal(alpha + beta * reading_time, sigma);
}

The ~ operator is Stan’s way of saying “is distributed as”. On the left is a parameter (or data), on the right is a probability distribution. Stan uses this to compute the log posterior probability, which is what the sampler needs.

The prior alpha ~ normal(50, 20) encodes the belief that the intercept — the expected score when reading_time = 0 — is somewhere around 50, give or take 20 points. That feels reasonable for a 0–100 scale.

The prior beta ~ normal(0, 1) says we expect the slope to be somewhere around 0 (we’re not sure of the direction or magnitude). A standard deviation of 1 allows for slopes up to about ±2 with reasonable probability — not wildly restrictive, but also not a flat prior that lets Stan explore implausible values like a slope of 50.

The prior sigma ~ exponential(1) puts most prior mass on small residual standard deviations while ruling out negative values. It’s a common weakly informative choice for scale parameters.

The likelihood line is where the magic happens. reading_score ~ normal(alpha + beta * reading_time, sigma) is vectorized: Stan applies this statement to all N observations at once. No loop needed. Under the hood, Stan computes the log-normal density for each observation and adds them up. This is one of Stan’s elegances — it reads almost like mathematical notation.

Putting it all together

Here’s the complete Stan program:

// reading_model.stan
// Simple linear regression: reading comprehension ~ reading time

data {
  int<lower=0> N;               // number of observations
  vector[N] reading_time;       // predictor: daily reading time (minutes)
  vector[N] reading_score;      // outcome: comprehension score (0-100)
}

parameters {
  real alpha;              // intercept
  real beta;               // slope for reading_time
  real<lower=0> sigma;     // residual standard deviation
}

model {
  // priors
  alpha ~ normal(50, 20);
  beta  ~ normal(0, 1);
  sigma ~ exponential(1);

  // likelihood (vectorized)
  reading_score ~ normal(alpha + beta * reading_time, sigma);
}

Now let’s compile and run it using cmdstanr. This is done in multiple steps.

First, we create an object called stan_code that contains our model definition as long string in R.

This model is then stored on the computer in a separate file with the writeLines( ) function.

In the next step we explicitly compile the code that is stored in the .stan file. Once that is done, we define our data list object and we can start the sampling.

Show the code
# STEP 1: define the model

stan_code <- "
// reading_model.stan
// Simple linear regression: reading comprehension ~ reading time

data {
  int<lower=0> N;               // number of observations
  vector[N] reading_time;       // predictor: daily reading time (minutes)
  vector[N] reading_score;      // outcome: comprehension score (0-100)
}

parameters {
  real alpha;              // intercept
  real beta;               // slope for reading_time
  real<lower=0> sigma;     // residual standard deviation
}

model {
  // priors
  alpha ~ normal(50, 20);
  beta  ~ normal(0, 1);
  sigma ~ exponential(1);

  // likelihood (vectorized)
  reading_score ~ normal(alpha + beta * reading_time, sigma);
}"

# STEP 2: store the model on the computer as a .stan file

writeLines(stan_code, "reading_model.stan")

# STEP 3: compile the model that is stored in the .stan file

reading_model <- cmdstan_model("reading_model.stan")

# STEP 4: prepare the data list

stan_data <- list(
  N             = nrow(reading_data),
  reading_time  = reading_data$reading_time,
  reading_score = reading_data$reading_score
)

# STEP 5: do the actual sampling

fit_stan <- reading_model$sample(
  data            = stan_data,
  seed            = 42,
  chains          = 4,
  parallel_chains = 4,
  iter_warmup     = 1000,
  iter_sampling   = 2000,
  refresh         = 0
)

fit_stan$save_object(file = "fitted_models/fit_stan.RDS")
Show the code
fit_stan$summary(
    variables = c("alpha", "beta", "sigma")
)
# A tibble: 3 × 10
  variable   mean median     sd    mad     q5    q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 alpha    37.4   37.3   1.94   1.93   34.3   40.6    1.00    3027.    3203.
2 beta      0.474  0.476 0.0536 0.0527  0.385  0.560  1.00    2986.    3297.
3 sigma    10.4   10.4   0.656  0.644   9.39  11.6    1.00    3874.    3618.

Take a look at the Rhat column — values close to 1.00 tell us the four chains converged to the same posterior. The ess_bulk and ess_tail columns give effective sample sizes; values above 400 are generally reassuring. If you see warnings about divergences or low ESS, that’s Stan’s way of telling you something went wrong — which almost never happens with a clean linear regression like this one.

brms vs. Stan: the comparison

Here’s the moment of truth. Let me pull the posterior summaries from both fits and put them side by side.

Show the code
# brms posterior — drawn from stored draws, no backend needed
brms_summary <- as_draws_df(fit_brms) |>
  rename(alpha = b_Intercept, beta = b_reading_time) |>
  select(alpha, beta, sigma) |>
  posterior::summarise_draws() |>
  select(variable, mean, sd, q5, q95) |>
  mutate(source = "brms")

# Stan posterior summary
stan_summary <- fit_stan$summary(variables = c("alpha", "beta", "sigma")) |>
  select(variable, mean, sd, q5, q95) |>
  mutate(source = "Stan")

bind_rows(brms_summary, stan_summary) |>
  arrange(variable, source) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  knitr::kable(caption = "Posterior summaries: brms vs. Stan (hand-written model)")
Posterior summaries: brms vs. Stan (hand-written model)
variable mean sd q5 q95 source
alpha 37.365 1.937 34.301 40.603 Stan
alpha 37.274 1.937 34.087 40.467 brms
beta 0.474 0.054 0.385 0.560 Stan
beta 0.476 0.053 0.390 0.561 brms
sigma 10.421 0.656 9.389 11.552 Stan
sigma 10.427 0.645 9.410 11.545 brms

They match. Not exactly — MCMC is stochastic, so there will always be tiny differences between runs — but the means, standard deviations, and credible intervals are essentially the same. This is reassuring: our hand-written Stan model is doing exactly what brms does under the hood, just without the scaffolding.

It’s also a useful sanity check you can always perform: if your Stan model gives you something wildly different from the equivalent brms model, something has gone wrong.

Let me also visualize the posteriors for the three parameters to get a sense of how concentrated or spread out they are:

Show the code
fit_stan$draws(
    variables = c("alpha", "beta", "sigma"), format = "df") |>
  pivot_longer(cols = c(alpha, beta, sigma),
               names_to = "parameter", values_to = "value") |>
  ggplot(aes(x = value)) +
  stat_halfeye(
      fill = "darkgrey", 
      color = "black", 
      .width = c(0.5, 0.95)
  ) +
  geom_vline(
    data = \(d) d |> 
      group_by(parameter) |> 
      summarise(mean_val = mean(value)
     ),
    aes(xintercept = mean_val),
    color = "#D95F02", linewidth = 1
  ) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = "Parameter value", y = NULL)

Posterior distributions for the three parameters of the linear regression model, as estimated by our hand-written Stan model. Orange lines mark the posterior mean.

The posterior for beta is comfortably above zero — daily reading time does seem to predict comprehension scores. The posterior for sigma tells us there’s quite a bit of residual variation: even knowing how much a child reads, their score is uncertain to the tune of about 12 points.

What we haven’t done yet

Look back at the Stan code we wrote: data, parameters, model. Three blocks out of six. We haven’t touched transformed data, transformed parameters, or — most importantly for the next post — generated quantities.

The generated quantities block is where Stan can produce derived quantities after sampling: posterior predictions, log-likelihood values, or — crucially — draws from prior distributions. In the next post we’ll use generated quantities to do something brms calls “prior predictive checks” entirely from within Stan, which gives us fine-grained control over what our model assumes before it ever sees the data. That’s where things start to get really interesting.

Wrap-up

We wrote our first Stan program. It has three blocks, nine lines of actual code (excluding comments), and it produces results indistinguishable from brms. The data block declares what comes in from R. The parameters block names what we want to estimate. The model block states our beliefs (priors) and how the data arose (likelihood). That’s the whole skeleton — everything else is elaboration.

I find it genuinely satisfying to run something this simple from scratch and watch it converge. If you’ve never done it before, I hope you feel that too. And if something doesn’t match up when you try it yourself — chains not converging, unexpected warnings, results that look off — I’d love to hear about it. Feel free to reach out. Have fun!

Acknowledgement

This post was developed with the help of Claude Sonnet 4.6 (Anthropic) during the ideation and drafting phase. All content has been checked, revised, and rewritten by me.