Bayesian multilevel models with brms: a demo with TIMSS2019 data

Bayesian inference
Multilevel modelling
brms
Stan
R
TIMSS
tidybayes
bayesplot
Bayesian statistics is on the rise. There is no doubt about that. But the learning curve is quite steep for many researchers. In this post, I give that extra push as you climb that mountain on your route to learning Bayesian statistics. A short demo on how to handle the super package brms to estimate some standard multilevel models can hopefully be inspiring and reassuring. While you’re on the road you may also learn some neat plotting skills!
Author

Sven De Maeyer

Published

February 22, 2021

A great number of disciplines are embracing Bayesian statistics as a tool for the applied researcher. Although some influential scholars like Richard McElreath hold a plea to teach and learn Bayesian statistics by making use of an insider perspective rather than an outsider perspective, it is sometimes hard to imagine what Bayesian statistics is about without a linkage to analyses and models we are familiar with. This post has the objective to introduce Bayesian statistics in the world of educational researchers by applying it to the domain of educational effectiveness studies. It is not my ambition here to give a thorough statistical and mathematical underpinning of Bayesian statistics. The approach for this post is to introduce brms and the workflow of running Bayesian analyses meanwhile taking you along in the kind of inferences that can be drawn.

I will have succeeded in my aim if, after reading the post:

Why brms?

One of the most widely used programming languages for running Bayesian models is Stan (Stan Development Team 2020). Given that Stan is actually a programming language it has a steep learning curve, certainly for people who are less confident in learning new programming languages. To overcome the need to learn the Stan language some great packages have been developed in R to allow regular R-users to use the programming language they are most familiar with: R-code. One package stands out in its ‘simplicity’ and ‘versatility’: brms (Bürkner 2018). Following diagram (from Bürkner 2017) demonstrates the workflow of brms:

Show the code
knitr::include_graphics("Workflow_brms.jpeg")

Figure based on Bürkner 2017

The most convenient part of this workflow is that the analyst familiar with R-formula notation as used in lm() or lmer() can use this same notation! But, in brms this formula is even extended for a broader range of models (e.g. finite mixture models; multivariate models) that we will not cover here. Time to demonstrate the workflow and show how ‘easy’ this is.

The TIMSS 2019 dataset

For this post we make use of a public available dataset: TIMSS 2019 (see [https://timss2019.org/international-database/] ). More specifically we will use the data for the Dutch-speaking part of Belgium (aka Flanders).

Before we dig into the analyses, let’s load the packages needed and set a default theme for the plots we will make.

Time to import the datasets. We use two datasets from the international release (note that bfl stands for the specific country code):

  • asgbflm7.sav : student achievement data and some background data;
  • acgbflm7.sav : school level data.

As these data are stored in SPSS format (.sav) we have to import it. One of the many dedicated packages to do this job is haven (Wickham and Miller 2020) from which we can use the read.sav() function.

Show the code
library(haven)
Student_FL19 <- read_sav("asgbflm7.sav") %>%
  select(IDSCHOOL, IDSTUD, ITSEX, ASMMAT01) %>%
  mutate(math = ASMMAT01,
         Girl = if_else(ITSEX == 1, 1, 0)) %>%
  zap_labels() 

School_contextFL19 <- read_sav("acgbflm7.sav") %>%
  select(IDSCHOOL, ACDGSBC) %>%
  mutate(SES_SCHOOL = factor(ACDGSBC)) %>%
  zap_labels()

Once we have imported both datasets we make a selection of variables used in this post. From the student achievement dataset we only retain the variables that identify the students and the schools that they belong to (IDSTUD & IDSCHOOL), the gender of the student (ITSEX) and the first plausible value for overall mathematics skill (ASMMAT01). Making use of the mutate( ) function we rename this last variable to math and we create a dummy-variable named Girl that has value 1 if the student is a girl and value 0 if not. Both variables are created for convenience further on in the analyses.

From the school-level dataset we keep the IDSCHOOL variable as well as the variable ACDGSBC which is an indicator at school level on School Composition by Socioeconomic Background (with category 1 indicating that the composition is ‘More Affluent’, category 2 indicating that the composition is ‘Neither more Affluent nor more Disadvantaged’ and category 3 indicating that the composition is ‘More Disadvantaged’). To make our coding somewhat easier further on we will rename this variable to SES_SCHOOL and transform it to a factor.

First exploration of the data

Although a good Bayesian doesn’t look at the data before building a model and formulating hypotheses, we can still get a quick peek at the data to note whether some irregularities are present. In the following graph we (just to demonstrate the power of ggplot…) present a pairs plot making use of the GGally package (Schloerke et al. 2021). The data for the plot are aggregated data to the school-level to give us some first very raw idea on what to expect when we will model the impact of School Composition by Socioeconomic Background on mathematics scores.

Show the code
library(GGally)
Student_FL19 %>%
  left_join(School_contextFL19) %>%
  group_by(IDSCHOOL) %>%
  summarize(mean_math     = mean(ASMMAT01, na.rm = T),
            ACDGSBC       = first(ACDGSBC)) %>%
  mutate(SES_SCHOOL = as.factor(ACDGSBC)) %>%
  select(mean_math, 
         SES_SCHOOL) %>%
  filter(SES_SCHOOL != 9) %>%
  ggpairs(aes(color=SES_SCHOOL),upper = list(continuous = wrap("cor", size = 3))) +
  theme_linedraw(base_size = 7)

Pairs plot for TIMSS 2019 mathematics score and School Composition by Socioeconomic Background

Before we kick into our Bayesian model estimation, let’s create a dataset that merges both datasets and only keeps data of students for which we have complete data (making use of drop_na()) so we can do a proper model comparison later on:

Show the code
Student_FL19 <- Student_FL19 %>%
  left_join(School_contextFL19) %>%
  drop_na()
Joining with `by = join_by(IDSCHOOL)`

We’re ready to model!

Time to model

In this post we will estimate 3 models:

  • a very naive model
  • a multilevel null model
  • a multilevel model with two predictors

The very naive model

Let’s start with a very naive model stating that we do not have any information on individual students nor the school or class context to predict mathematics scores. Of course we can assume that there is a multitude of factors that determine individual scores of students (e.g. prior schooling, parental support, peer support, motivation, teacher’s practices,…) so we can see an individual student’s mathematics score as an average of all these factors. Therefore, we can apply the central limit theorem and assume that the math score of an individual student i is coming from a normal distribution (see Lambert 2018). One way to write this formally is as follows:

\[\begin{equation} \begin{aligned} \text{math}_{i} \sim & N(\mu,\sigma_{e_{i}})\\ \end{aligned} \end{equation}\]

with \(\text{math}_{i}\) indicating the score of an individual student \(i\), \(\mu\) indicating an overall average and \(\sigma_{e_{i}}\) indicating the variance between students.

Essentially this boils down to a simple regression model with no explanatory variables and only an intercept. In brms we can use the following code to estimate this basic model making use of the brm( ) function with math~1 indicating that we only include an intercept in our model (1 is a placeholder in R formulation for incorporating an intercept in a model) and family = "gaussian" indicating that we assume a normal distribution.

Show the code
Model_math_naive <- brm(
  math ~ 1 ,             # Typical R formula style
  data = Student_FL19,  
  family = "gaussian")

Now, before running this model it is always advised to think about our priors. brms will give us some default uninformative priors that you can inspect making use of the get_prior( )function.

Show the code
get_prior(
  math ~ 1 ,            
  data = Student_FL19
  )
                     prior     class coef group resp dpar nlpar lb ub  source
 student_t(3, 533.1, 69.7) Intercept                                  default
     student_t(3, 0, 69.7)     sigma                             0    default

For instance, for the Intercept a Student-t distribution with 3 degrees of freedom, an average of 533.1 and sigma of 68.8 is suggested by brms. Alternatively we can construct our own prior, based on what we already know on average mathematics scores for Flanders from the previous TIMSS studies: in 2003 the average was 551, in 2011 the average was 549 and in 2015 the average was 546. We could use this prior knowledge to construct a prior by using a normal distribution with an average of 548.9 (being the mean of the 3 previous results for Flanders) and a sigma of 25 expressing our uncertainty on the average for 2019. To illustrate the difference between our custom prior and the default brmsprior we can create a plot on both probability density functions:

Show the code
library(metRology) # to use dt.scaled()

x <- seq(350,650,by=.1)
custom       <- dnorm(x,548.9,25)
brms_default <- dt.scaled(x,3,533.1,68.8)

data_frame(x,custom,brms_default) %>%
  pivot_longer(c(custom,brms_default),names_to="Prior") %>%
  ggplot(aes(x = x, y = value, lty = Prior)) + 
  geom_line() + 
  scale_x_continuous("math",limits = c(350,650)) +
  scale_y_continuous("probability density")
Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.

Our custom prior and the brms default prior for the Intercept

Let’s use the custom prior for the Intercept and stick to the default prior for the sigma parameter in our model. In the following code we will set our prior for the parameter of class Intercept and store it in the object priorMath. Next we can call the brm() function with some additional arguments to define our prior (prior = priorMath), to make use of parallel computing power by using 4 cores for the estimation (cores = 4) and to make our analysis reproducible by setting a seed (seed = 2021).

Show the code
priorMath <- c(set_prior("normal(548.9,25)", class = "Intercept"))

Model_math_naive <- brm(
  math ~ 1 ,             
  data = Student_FL19,  
  family = "gaussian",  
  prior = priorMath, # use our custom prior
  cores = 4,         # make use of 4 cores
  seed = 2021        # to make it reproducible ;-)
)

Once our model has run we are ready to inspect our outcomes! Let’s start with the probably most used function in R: summary().

Show the code
load("Model_math_naive.R")
summary(Model_math_naive)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: math ~ 1 
   Data: Student_FL19 (Number of observations: 4257) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   532.13      1.03   530.11   534.17 1.00     3601     2775

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    67.69      0.73    66.22    69.13 1.00     3328     2763

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

From the summary we can learn that our model estimation converged well: \(\widehat R\) < 1.015 and effective sample size greater than 100 times the number of chains (=4) for each parameter (Vehtari et al. 2021).

Interpretation of the output can be done in a similar way as interpreting regular R output. For instance we can see that 532.13 and 67.69 are the point estimates for respectively the intercept and sigma. Nevertheless, as true Bayesians we want to express our uncertainty rather than sticking to a single point estimate. The output generated gives us some first information: 95% credible intervals. Concerning the intercept we learn that we the 95% most probable values for the true population average for mathematics in Flanders lay between 530.11 and 534.17.

Rather than sticking to the output generated with the summary() command, in a publication we would like to visualise our uncertainty. The package brms has a great command to extract our posterior samples: posterior_samples(). This function will form the base for the different plots we will create. Let us first have a quick view on what we get out of this command so our code used further on in this post will make more sense. We wrap this command in a head() function in order to print only the first 5 entries.

Show the code
head(
  posterior_samples(Model_math_naive), 5
  ) 
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.
  b_Intercept    sigma      lp__
1    529.9010 66.40590 -23991.92
2    531.7248 67.90770 -23988.07
3    532.4621 68.11656 -23988.16
4    531.7378 67.66156 -23988.02
5    530.9714 67.92879 -23988.63

We see that we get a column called b_Intercept storing possible parameter values for our Intercept parameter in the model and a column called sigma storing possible parameter values for our sigma parameter in the model. Actually when posterior_samples() is used we get 4000 intercept and 4000 sigma values:

Show the code
dim(posterior_samples(Model_math_naive))
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.
[1] 4000    3

These samples are samples of parameter values coming from our posterior distribution and can be used to plot the probability densities. The fact that we have 4000 samples is due to the default settings we used from brms asking to run the estimation algorithm with 4 chains, each consisting of 2000 iterations of which 1000 iterations are burn-in iterations.

Now that we know that, we can use this as an input to plot our uncertainty making use of a first visualisation with geom_interval(). We plot a 50%, 89% and 95% credible interval.

Show the code
posterior_samples(Model_math_naive) %>%     # Get draws of the posterior
  select(b_Intercept) %>%                   # Select our Intercept estimate column
  median_qi( .width = c(.95, .89, .5)) %>%  # Choose the intervals we want to plot
  ggplot(aes(x = b_Intercept)) +            # Let's plot...
   geom_interval(aes(xmin = .lower,         #   using geom_interval
                     xmax = .upper)) +
   xlab("marginal posterior")     +         # Set our title for x-axis
   scale_y_continuous("b_Intercept", breaks = NULL) +
   scale_color_brewer()                     # Choose a color scale
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.

CI’s for the Intercept parameter based on the naive model

We can also combine the information on both parameters in a single plot.

Show the code
posterior_samples(Model_math_naive) %>%        # Get draws
  select(b_Intercept,sigma) %>%                # Select both parameters
  pivot_longer(everything()) %>%               # Create a dataset of long format
  ggplot(aes(x = value, y = 0)) +              # Now we plot using ...
   stat_interval(.width = c(.50,.89,.95)) +    # stat_interval()
   scale_y_continuous(NULL, breaks = NULL) +
   xlab("marginal posterior") +
   facet_wrap(~name, scales = "free") +
   scale_color_brewer() 
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.

CI’s for the Intercept and sigma parameter based on the naive model

To create this kind of graphs it is always wise to start from examples like these and apply them on your own results.

Multilevel null model

Our previous model was somewhat naive. We know that these students cannot be seen as independently sampled from a population of 10 year olds. The sampling design in TIMSS is two-staged: first a sample of schools is drawn and then a sample of students within the school. Some students share that they come from the same school and therefore they cannot be seen as independent observations. Time to take the school into account and introduce our first genuine multilevel model: a null model. This model only includes an intercept, a variance between schools and a variance between pupils within schools. A typical formal way to write down this model in multilevel publications is as follows:

\[\begin{equation} \text{math}_{ij} = \beta_0 + u_j + e_{ij}\\ \text{with} \\ u_j \sim N(0,\sigma_{u_{j}})\\ \text{and} \\ e_{ij} \sim N(0,\sigma_{e_{ij}}) \end{equation}\]

Note that we added the subscript j now, indicating that we model the mathematics score of a pupil i in school j. And we have two residuals in the model: \(u_j\) referring to a school specific residual and \(e_{ij}\) as the individual student’s deviation from the school average. For both residuals we assume that they are normally distributed with an expected value (or average as you wish) of 0. This same model can also be written in an alternative way, more in line with the more general way of writing models in a Bayesian way:

\[\begin{equation} \begin{aligned} & y_{ij} \sim N(\mu,\sigma_{e_{ij}})\\ & \mu = \beta_0 + u_j \\ & u_j \sim N(0,\sigma_{u_{j}})\\ \end{aligned} \end{equation}\]

The code to analyze this model in brms, making use of the same priors as in our naive model, looks like this:

Show the code
Model_math_null <- brm(
  math ~ 1 + (1|IDSCHOOL), # typical lme4 style notation
  data = Student_FL19,
  family = "gaussian",
  prior = priorMath,
  cores = 4,
  seed = 2021
  )

Adding (1|IDSCHOOL) in the formula means that we are expecting the Intercept to vary from school to school. The model output is again easily accessed making use of the summary() command.

Show the code
load(file = "Model_math_null.R")
summary(Model_math_null)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: math ~ 1 + (1 | IDSCHOOL) 
   Data: Student_FL19 (Number of observations: 4257) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~IDSCHOOL (Number of levels: 133) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    27.80      2.13    23.86    32.21 1.00     1235     1905

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   532.22      2.54   527.33   537.32 1.00     1312     1852

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    62.47      0.70    61.12    63.82 1.00     7801     3026

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

From the output we learn that the uncertainty concerning the intercept (the population average mathematics score) is increased as compared to our naive model. Now, the 95% most credible population average scores are between 527 and 537. This loss in precision is exactly what is to expect from this type of model, given the extra variance that we modelled: the variance between schools. For that variance parameter (actually a standard deviation) our model indicates that it somewhere between 24 and 32, with a point estimate of 28. In a frequentist analysis we would get only that point estimate which would be typically used to calculate the Intra-class correlation (ICC). Based on our point estimates of the model we would get an ICC of 0.165 (\(\text{ICC}=\frac{27.80^2}{27.0^2 + 62.47^2}\)) indicating that 16.5% of the variance in mathematics scores can be attributed to differences between schools. The nice thing about a Bayesian approach is that we can use our posterior samples to express our uncertainty about this measure! Here we use our posterior_samples() function again and with the mutate() step we calculate an ICC for each of the 4000 samples resulting in 4000 ICC values (only 5 are shown here).

Show the code
head(
  posterior_samples(Model_math_null) %>%
    mutate(ICC = (sd_IDSCHOOL__Intercept^2/(sd_IDSCHOOL__Intercept^2 + sigma^2))) %>%
    select(b_Intercept, sigma, sd_IDSCHOOL__Intercept, ICC), 
  5
  ) 
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.
  b_Intercept    sigma sd_IDSCHOOL__Intercept       ICC
1    531.5805 62.78881               28.33029 0.1691461
2    529.5665 62.00816               26.59448 0.1553654
3    532.8093 62.23712               28.62321 0.1745861
4    533.1545 62.40529               28.78412 0.1754256
5    531.5764 62.63484               25.88342 0.1458612

Even a nicer thing is that we can also plot our uncertainty about the ICC!

Show the code
posterior_samples(Model_math_null) %>%      # Get draws of the posterior
  mutate(ICC = (sd_IDSCHOOL__Intercept^2/(sd_IDSCHOOL__Intercept^2 + sigma^2))) %>%
  select(ICC) %>%                           # Select ICC
  median_qi( .width = c(.95, .89, .5)) %>%  # Choose the intervals we want to plot
  ggplot(aes(x = ICC)) +                    # Let's plot...
   geom_interval(aes(xmin = .lower,         #   using geom_interval
                     xmax = .upper)) +
   scale_x_continuous("marginal posterior", 
                      breaks = seq(.13,.21,by =.01)) + # define our x-scale
   scale_y_continuous("ICC", breaks = NULL) +
   scale_color_brewer()                     # Choose a color scale
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.

CI’s for the ICC based on the null model

This plot shows that we are 89% confident that the ICC is somewhere between .135 and .205.

We can now bring it all together and create a plot incorporating the marginal posterior distributions of all relevant parameters from the model. The following code does this for us. Note that I used another way of visualizing the uncertainty, making use of a stats_dotsinterval. Research has shown that this kind of visualisation leads to more correct interpretation than making use of intervals or density curves (see [https://mucollective.northwestern.edu/]).

Show the code
names <- c("math average","sd pupillevel","sd schoollevel", "ICC")

posterior_samples(Model_math_null) %>%
  mutate(ICC = (sd_IDSCHOOL__Intercept^2/(sd_IDSCHOOL__Intercept^2 + sigma^2))) %>%
  select(b_Intercept,sigma,sd_IDSCHOOL__Intercept, ICC) %>%
  set_names(names) %>%
  pivot_longer(everything()) %>%
  mutate(name = factor(name, levels = c("math average",
                                        "sd schoollevel", 
                                        "sd pupillevel", 
                                        "ICC"))) %>%
  ggplot(aes(x = value, y = 0)) +
   stat_dotsinterval(.width = c(.50,.89,.95), 
                     normalize = "panels", 
                     quantiles = 100) +
   scale_y_continuous(NULL, breaks = NULL) +
   xlab("marginal posterior") +
   facet_wrap(~name, scales = "free")
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.

Dot density plots for the marginal posteriors for the parameters from the null model

Another plot type you may encounter in multilevel publications is a ‘caterpillar’ plot. Such a plot shows the school-level deviations and a confidence interval for each school’s deviation towards the overall mean. We can make a similar plot easily, again making use of our posterior_samples() function. Before we do that, it is nice to know that we also have 4000 samples of parameter estimates for each \(u_j\) that can be used to plot. A quick peek at shows what I mean. After pulling out our 4000 samples of parameter estimates out of the object containing our fitted model we select all parameters that are related to IDSCHOOL making use of the convenient starts_with() function. Then, for the sake of demonstration, we only keep the specific deviation parameters or the first 3 schools in our dataset:

Show the code
head(
  posterior_samples(Model_math_null) %>%
    select(starts_with("r_IDSCHOOL")) %>%
    select(1:3), 5
)
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.
  r_IDSCHOOL[5002,Intercept] r_IDSCHOOL[5003,Intercept]
1                   28.73855                   17.71205
2                   24.20709                   42.17878
3                   25.47716                   37.23581
4                   31.70367                   22.24863
5                   13.27713                   49.58657
  r_IDSCHOOL[5004,Intercept]
1                  -30.94638
2                  -60.69503
3                  -28.98984
4                  -49.08165
5                  -45.03160

So, what we get is for schools with ID codes 5002, 5003 and 5004, 4000 samples of their deviation towards the overall mean. Of course, a plot can make a great summary of this. Let’s plot a caterpillar plot for a random selection of 20 schools in our sample.

Show the code
set.seed = 2021

posterior_samples(Model_math_null) %>% 
  select(starts_with("r_IDSCHOOL")) %>%
  select(sample(20)) %>%
  pivot_longer(starts_with("r_IDSCHOOL")) %>% 
  mutate(sigma_i = value) %>%
  ggplot(aes(x = sigma_i, y = reorder(name, sigma_i))) +
    stat_pointinterval(point_interval = mean_qi, .width = .89, 
                       size = 1/6) +
    scale_y_discrete(expression(italic(j)), breaks = NULL) +
    labs(x = expression(italic(u)[italic(j)])) +
    coord_flip()
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.

Caterpillar plot for a random selection of 20 schools with a 89% CI

Predictive model

Most of the applied researchers will introduce some predictor variables. To show how this is done in brms let us introduce two variables that might be related to differences in mathematics scores: Girl and SES_SCHOOL.

Before we build our model it is wise to think about what to we already know concerning the effects of both variables.

For the effect of Girl we can expect girls to score lower than boys given that for Flanders:

  • in TIMSS 2015 girls scored 6 points lower than boys and
  • in TIMSS 2011 girls scored 8 points lower than boys.

Based on that knowledge we can set a normal distribution with an average of 7 and a standard deviation of 5 as a prior (see figure below to get a visual representation of our prior belief).

For the effect of SES_SCHOOL we have less information available to define our prior. But, based on our domain knowledge, we can at least expect that

  • schools with a “More Affluent” composition will score higher than
  • schools with “Neither more Affluent nor more Disadvantaged” composition, who will score higher than
  • schools with “More Disadvantaged” composition.

Therefore we will use priors that express this expectation but also express a larger amount of uncertainty than our uncertainty on the effect of Girl. We will use a normal distribution agian, but this time with an average of -10 and a standard deviation of 10 for the parameter that models the difference between the category ‘More Affluent’ and ‘Neither more Affluent nor more Disadvantaged’ (this parameter will be called SES_SCHOOL2) and an average of -20 and a standard deviation of 10 for the parameter that will capture the difference between the categories ‘More Affluent’ and ‘More Disadvantaged’ (this parameter will be called SES_SCHOOL3).

The following plot gives a visualisation of our priors. Notice that the prior for the effect of Girl is more peaked and narrower than the two other priors, witnessing the higher certainty we have on this gender difference than on the effect of School Composition by Socioeconomic Background.

Show the code
x <- seq(-40,10,by=.1)
Girl         <- dnorm(x,-7,5)
SES_SCHOOL2  <- dnorm(x,-10,10)
SES_SCHOOL3  <- dnorm(x,-20,10)

data_frame(x,Girl,SES_SCHOOL2,SES_SCHOOL3) %>%
  pivot_longer(c(Girl,SES_SCHOOL2,SES_SCHOOL3),names_to="Prior") %>%
  ggplot(aes(x = x, y = value, lty = Prior)) + 
  geom_line() + 
  scale_x_continuous("math",limits = c(-40,10)) +
  scale_y_continuous("probability density") 

Priors for our slope parameters

The brms code looks as follows:

Show the code
priorMath_predmodel <- 
  c(set_prior("normal(548.9,25)", class = "Intercept"),
    set_prior("normal(-7,5)", class = "b", coef = "Girl"),
    set_prior("normal(-10,10)", class = "b", coef = "SES_SCHOOL2"),
    set_prior("normal(-20,10)", class = "b", coef = "SES_SCHOOL3")
  )

Model_math_pred <- brm(
  math ~ 1 + Girl + SES_SCHOOL + (1|IDSCHOOL),
  data = Student_FL19,
  family = "gaussian",
  prior = priorMath_predmodel,
  cores = 4,
  seed = 2021,
)

Let’s have look at our output.

Show the code
load(file = "Model_math_pred.R")
summary(Model_math_pred)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: math ~ 1 + Girl + SES_SCHOOL + (1 | IDSCHOOL) 
   Data: Student_FL19 (Number of observations: 4257) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~IDSCHOOL (Number of levels: 133) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    20.69      1.77    17.43    24.33 1.00     1620     2500

Population-Level Effects: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     548.89      2.59   543.77   553.81 1.00     2085     2319
Girl          -12.74      1.83   -16.31    -9.13 1.00    10671     2770
SES_SCHOOL2   -16.43      4.83   -25.76    -6.99 1.00     2241     2734
SES_SCHOOL3   -40.40      4.89   -49.81   -30.55 1.00     2238     2323

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    62.17      0.70    60.84    63.56 1.00     8486     2809

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

From the output we learn that the previous decline in difference between boys and girls for mathematics in Flanders (see results of TIMSS 2011 and TIMSS 2015) is stopped and the difference between boys and girsl is increased again. While the difference in the previous TIMSS wave was around 6 points, our model shows that we are 95% confident that in 2019 the difference is somewhere between 16 and 9 points. To interpret the output for the effect of SES_SCHOOL we will plot our marginal posterior distribution of average scores for each of the three categories of schools, making use of the posterior_samples() function and mutate() to calculate average scores derived for each of the 4000 sampled sets of parameters.

Show the code
names <- c("More Affluent",
           "Neither more Affluent nor more Disadvantaged",
           "More Disadvantaged")

posterior_samples(Model_math_pred) %>%
  mutate(Affluent = b_Intercept, 
         Neutral = (b_Intercept + b_SES_SCHOOL2) , 
         Disadvantaged = (b_Intercept + b_SES_SCHOOL3)) %>%
  select(Affluent, Neutral, Disadvantaged) %>%
  set_names(names) %>%
  pivot_longer(everything()) %>%
  mutate(name = factor(name, 
                       levels = c("More Disadvantaged",
                                  "Neither more Affluent nor more Disadvantaged", 
                                  "More Affluent"))) %>%
  ggplot(aes(x = value, y = 0)) +
  stat_dotsinterval(.width = .89, 
                    normalize = "panels", 
                    quantiles = 100) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab("marginal posterior") +
  facet_wrap(~name, scales = "free") 
Warning: Method 'posterior_samples' is deprecated. Please see ?as_draws for
recommended alternatives.

Marginal posterior distributions for the average math scores according to School Composition by Socioeconomic Background

Notice that all three 89% Credible Intervals do not overlap indicating that we may be quiet sure that there are differences in student’s mathematics scores according to the School Composition by Socioeconomic Background of the school they attend.

Comparing models

A final thing that we can do is comparing models. In a frequentist framework researchers will rely on a likelihood ratio test and/or compare information criteria like the AIC and the BIC. For Bayesian analyses we can rely on leave-one-out cross-validation methods. The interested reader in how these indices are calculated and a deeper dive into the philosophy behind these measures is advised to read Lambert (2018).

To do this we can make use of the loo() and loo_compare() functions. For each model we apply the loo() function and store the results in a dedicated object. Then we use the loo_compare() function and print the result. The code looks like this:

Show the code
loo_1 <- loo(Model_math_naive)
loo_2 <- loo(Model_math_null)
loo_3 <- loo(Model_math_pred)

loo_comparison <- loo_compare(loo_1,loo_2,loo_3)

To save some time in compiling this blogpost I first saved the results and load it to print it here.

Show the code
load(file = "loo_comparison.R")
print(loo_comparison, simplify = F)
                 elpd_diff se_diff  elpd_loo se_elpd_loo p_loo    se_p_loo
Model_math_pred       0.0       0.0 -23673.7     44.5       103.3      2.4
Model_math_null     -27.4       7.4 -23701.1     44.4       113.4      2.6
Model_math_naive   -311.0      23.9 -23984.6     44.6         1.9      0.1
                 looic    se_looic
Model_math_pred   47347.3     89.1
Model_math_null   47402.2     88.8
Model_math_naive  47969.3     89.2

We can see that the model including the two predictors has the highest expected log pointwise predictive density (\(elpd\) = -23673.7) which corresponds to the lowest looic, indicating this model fits best. The difference with the next best model in \(elpd\) is 27.4. This difference in \(elpd\) is clearly bigger than the accompanying standard error (it is 3.7 times bigger than the standard error) so we can conclude that this model is - I’m sorry guys - significantly better than the null model.

To conclude

That’s all folks…

I hope that this blogpost gives you some guidance and stimulation to make the transition towards a Bayesian workflow!

References

Bürkner, Paul-Christian. 2017. “Brms: An r Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software, Articles 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.
———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10 (1): 395–411. https://doi.org/10.32614/RJ-2018-017.
Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. Sage, Thousand Oaks.
Rutkowski, Leslie, Eugenio Gonzalez, Marc Joncas, and Matthias Von Davier. 2010. “International Large-Scale Assessment Data.” Journal Article. Educational Researcher 39 (2): 142–51. https://doi.org/10.3102/0013189x10363170.
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2021. GGally: Extension to ’Ggplot2’. https://CRAN.R-project.org/package=GGally.
Stan Development Team. 2020. Stan Modeling Language Users Guide and Reference Manual. https://mc-stan.org.
Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved \(\widehat{R}\) for Assessing Convergence of MCMC.” Bayesian Analysis. https://doi.org/10.1214/20-BA1221.
Wickham, Hadley, and Evan Miller. 2020. Haven: Import and Export ’SPSS’, ’Stata’ and ’SAS’ Files. https://CRAN.R-project.org/package=haven.