How strong is the influence of the priors I have choosen? A short introduction to the `priorsense` package.

Bayesian inference

Statistics

brms

priorsense

Defining your prior beliefs about parameters and translating them in probability distributions is key in Bayesian analyses.That’s also one of the typical aspects that receives a lot of critique. Sometimes you hear statemens like ‘your results are driven by the priors you have set rather than your data’. As a Bayesian you have access to actually check whether this is true and to build a counterargument to this type of critique: a prior sensitivity analysis. In this post I demonstrate the use of a recently developped package that helps you in this endeavour: the priorsense package!

Author

Sven De Maeyer

Published

June 14, 2024

In Bayesian analyses we have the difficult duty as a researcher to express our prior knowledge (or belief) about the parameters in our model. More specifically, for each parameter we define a probability distribution that matches how we think about that parameter. Often, this is one of the main sources of critique on Bayesian analyses:

“You introduce subjectivity in your analyses! Your results might depend on your priors.”

In essence this can be the case. Bayesian analyses is about updating your priors by taking in the information coming from your data.

The following meme nicely illustrates how the results of a Bayesian analysis is the result of combining data and priors:

So, if you have limited data then your priors can drive your results stronger than if you have lots of data.

Potential impact of chosen priors is not limited to the case of limited data. Complex models (= models with a large number of parameters) imply setting a lot of priors that might interact. This “cocktail of priors” can result in a situation that one or more parameters are only marginally updated by the data.

Given this potential impact of your choice of priors, scholars have emphasized the importance of actually checking the impact of priors, called prior sensitivity analyses (Depaoli 2020).

The idea is compellingly simple: re-run your model with different priors and monitor the impact of this choice of priors on the results of the analyses. In practice, this might be an impossible endeavour. What alternative priors to set and for which (combination of) parameters? And what if the model is complex and already takes a long time to be estimated?

A recently published article and the accompanying package priorsense has paved the way to apply an alternative way of checking how sensitive the results are to the chosen priors: Kallioinen et al. (2023).

In this post we will demonstrate the use of this package and how to interpret the results. If you want to skip the more conceptual explanation on how this package works, you can jump immediately to the section The Friends Case.

Before we go into power-scaling, let’s make sure we have the necessary packages loaded:

Show the code

library(here)
library(brms)
library(tidyverse)
library(priorsense)
library(parameters)
library(bayesplot)
library(ggdist)
library(rstan)

Power-scaling

Kallioinen et al. (2023) summarize the strengths of the proposed methodology as follows:

it is computationally efficient
it is applicable to a wide range of models
it provides automated diagnostics
it requires minimal change to the workflow and model code

At the core of this methodology is the idea of using power-scaling the probability distributions. Before we further explain the rest of the methodology it is good to get an idea of what is meant by power-scaling a probability distribution and it’s effect on the shape of the probability distribution. Let’s re-use the example they use in their paper. Imagine that you have used \(N(0,1)\) (a normal distribution with mean 0 and st.dev. 1) as a prior probability distribution for a certain parameter. In power-scaling we raise this probability distribution to a certain power \(\alpha\). In our example, we change the distribution to \(N(0,1)^\alpha\). The effect is then that our prior probability distribution changes to \(N(0,\alpha^{-\frac{1}{2}}1)\). What this implies can be better inspected from a visualisation:

Show the code

library(purrr)

theme_set(theme_linedraw() +
            theme(panel.grid = element_blank()))

x <- seq(-2.5, 2.5, by = 0.001)

my_data <- data.frame(
  mean =  c(0,0,0),
  stdev = c(1, sqrt(0.5)*1, sqrt(2)*1), 
  test = c("alpha = 1", "alpha = 0.5", "alpha = 2"), 
                      stringsAsFactors = F)

tall_df2 <- pmap_df(
  my_data, 
  ~ data_frame(
    x = x, 
    test = ..3, 
    density = dnorm(x, ..1, ..2)
    )
  )

p <- ggplot(
  tall_df2, 
  aes(color = factor(test), 
      x = x, 
      y = density)
  ) + 
  geom_line() +
  labs(
    title = "Power-scaling N(0,1)",
    x = "parameter value",
    y = ""
  )

print(p)

From this visualization we can learn that by applying power-scaling we actually increase or decrease our uncertainty in our prior probability distribution (see how the width of the probability distribution is affected by applying power-scaling). A similar thing can be done with other distributions than the normal distribution. For instance, sometimes a t-distribution, a beta distribution or a gamma distribution is used to express our prior belief about a parameter. Using power-scaling we modify the shape of these distributions in a way that they “result in distributions that likely represent similar implied assumptions to those of the base distribution” (Kallioinen et al. 2023). Power-scaling can also have a similar effect on the likelihood of the data: raising the likelihood to a power higher or lower than 1 conceptually implies giving the data respectively more or less weight.

Now we can focus on how this power-scaling is actually used. In the proposed methodology the draws from the MCMC estimation (aka the results of our model) are used as a starting point. These MCMC draws (the possible values of our different parameters in the model) are assigned a certain weight based on changing the shape of the prior probability distribution (i.e., power-scaling the distribution) or the weight of the likelihood (i.e., power-scaling the likelihood). Then these weighted MCMC draws are used to reconstruct a perturbed posterior probability distribution. Finally, the original posterior probability distribution is compared to the perturbed posterior probability distribution to evaluate the effect of changing the shape of the prior or the weight of the data.

Of course, the way I expose the core idea of this methodology is an over-simplification. If you want to know more about how this all works in details, go over to the original paper of Kallioinen et al. (2023) .

The Friends Case

Although the maths behind the power-scaling used in the package priorsense is quiet complex, applying the prior sensitivity analyses with this package is rather straightforward. In this section I will demonstrate some of the features and results of the package by applying them to an example: the friends study.

This dataset is a simulated dataset that is based on an existing study of Frumuselu et al. (2015) . In this study, the key question was whether subtitles help in foreign language acquisition. Spanish students (n = 36) watched episodes of the popular tv-show “Friends” for half an hour each week, during 26 weeks. The students were assigned to 3 conditions:

English subtitled (condition “FL”, being Foreign Language)
Spanish subtitled (condition “MT”, being Mother Tongue)
No subtitles (condition “NoSub”, being No Subtitles)

At 3 occasions students got a Fluency test:

Before the 26 weeks started (Occ1)
After 12 weeks (Occ2)
At the end of the semester (Occ3)

The dependent variable is a measure based on the number of words used in a scripted spontaneous interview with a test taker. The data is structured in a long format, as follows:

Show the code

load(
  file = here("posts", "2024-02-PriorSense", "Subtitles.RData")
)

head(Subtitles, 9)

  student occasion condition fluency Occ1_NoSub Occ1_FL Occ1_MT Occ2_NoSub
1       1     Occ1        FL  101.25          0       1       0          0
2       1     Occ2        FL  103.76          0       0       0          0
3       1     Occ3        FL  117.39          0       0       0          0
4       2     Occ1        MT   98.79          0       0       1          0
5       2     Occ2        MT  106.75          0       0       0          0
6       2     Occ3        MT  110.54          0       0       0          0
7       3     Occ1     NoSub  104.83          1       0       0          0
8       3     Occ2     NoSub  102.04          0       0       0          1
9       3     Occ3     NoSub  100.63          0       0       0          0
  Occ2_FL Occ2_MT Occ3_NoSub Occ3_FL Occ3_MT
1       0       0          0       0       0
2       1       0          0       0       0
3       0       0          0       1       0
4       0       0          0       0       0
5       0       1          0       0       0
6       0       0          0       0       1
7       0       0          0       0       0
8       0       0          0       0       0
9       0       0          1       0       0

The model we will use

We will start by defining a first model that for our data. In Model 1 we estimate a population average for each combination of measurement occasion and condition, resulting in a model with 9 population averages:

To implement this model in brms we can use the 9 dummy variables in the dataset that are indicators of whether the observation belongs to a certain combination of an occasion and a condition. For instance, the dummy variable Occ1_NoSub has a value of 1 if the observation in that row is an observation coming from the No Subtitles condition taken at the first occasion. All other observations get a value 0for that dummy variable.

Priors version 1

In our first model we will - for the sake of the example here - make use of some prior knowledge about our fluency measure. Let’s assume that we have information that in a similar population on average students score 100 for this fluency test and that the estimated population standard deviation is 7. If we want to take this information into account in our prior probability distribution, we could, for instance, use a normal distribution with value 100 as average and a standard deviation of 3.5 for all the beta’s in our model (\(N(100,3.5)\)). So, we assume that for each combination of occasion and condition, the average is most likely 100 or around 100. The standard deviation of 3.5 for our prior distribution implies that we still give some probability for medium effect sizes as 3.5 equals a a medium effect (0.5 standard deviations) in the Cohen’s d framework. To implement this in brms we can set our own prior for all the b parameters. Next we can run our model.

Show the code

Custom_prior1 <- c(
  set_prior(
    "normal(0,3.5)",
    class = "b"
  )
)

M1 <- brm(
    fluency ~ -1 + Occ1_NoSub + Occ1_FL + Occ1_MT +
    Occ2_NoSub + Occ2_FL + Occ2_MT +
    Occ3_NoSub + Occ3_FL + Occ3_MT,  
  data = Subtitles,
  cores = 4,
  backend = "cmdstanr",
  seed = 1975,
  prior = Custom_prior1
)

Before we dive into the prior sensitivity analyses, let’s have a quick look at the results of the model:

Show the code

parameters::model_parameters(M1)

# Fixed Effects

Parameter  | Median |           95% CI |   pd |  Rhat |     ESS
---------------------------------------------------------------
Occ1_NoSub | 102.31 | [100.17, 104.45] | 100% | 1.000 | 8338.00
Occ1_FL    | 100.19 | [ 98.06, 102.24] | 100% | 0.999 | 7635.00
Occ1_MT    | 101.09 | [ 98.97, 103.19] | 100% | 0.999 | 6331.00
Occ2_NoSub | 102.05 | [ 99.95, 104.19] | 100% | 0.999 | 8015.00
Occ2_FL    | 106.27 | [104.04, 108.48] | 100% | 0.999 | 8767.00
Occ2_MT    | 108.02 | [105.70, 110.19] | 100% | 0.999 | 7341.00
Occ3_NoSub | 104.57 | [102.40, 106.75] | 100% | 0.999 | 7976.00
Occ3_FL    | 117.81 | [115.52, 119.98] | 100% | 1.000 | 7982.00
Occ3_MT    | 111.28 | [108.97, 113.39] | 100% | 0.999 | 7414.00

# Sigma

Parameter | Median |       95% CI |   pd |  Rhat |     ESS
----------------------------------------------------------
sigma     |   4.05 | [3.52, 4.72] | 100% | 1.000 | 5268.00

or visually:

Show the code

color_scheme_set(scheme = "red")

mcmc_areas_ridges(
  M1,
  pars = c(
  "b_Occ1_NoSub", "b_Occ1_FL", "b_Occ1_MT", 
  "b_Occ2_NoSub", "b_Occ2_FL", "b_Occ2_MT", 
  "b_Occ3_NoSub", "b_Occ3_FL", "b_Occ3_MT"
  ),
  prob = .89
)

The parameter estimation seemed to converge quiet good (see R-hat values and ESS). From the model we learn that at the start of the study students in the three conditions score somewhat similar (see how the top three probability distributions overlap). At the second occasion the fluency of students in the FL and MT condition has most probably improved, which is not the case for students in the NoSub condition. At the end of the semester students in the FL condition seem to outperform the students in the MT condition.

The key question is of course, how sensitive are the analyses to changes in our priors? Time to unpack the powers of the priorsense package.

A first thing we could do is make a visualisation of the impact of the different power-scalings to the posterior probability distribution. This can be done by using the powerscale_plot_dens() function in the following way (see how we apply this to a subset of parameters with the variables argument):

Show the code

powerscale_plot_dens(
  powerscale_sequence(
    M1
  ),
  variable = c(
    "b_Occ1_NoSub",
    "b_Occ2_NoSub",
    "b_Occ3_NoSub"
  )
)

The top row of this visualisation shows how the posterior probability distribution for each of the selected parameters changes based on changing the prior. As we can see in these visualisations, the posterior probability density functions seem to shift somewhat by changing the alpha. Changing the alpha actually implies applying that value as a power to power-scale the distribution. The observed pattern is most apparent for the b_Occ2_NoSub and the b_Occ3_NoSub parameter: lower alpha (bluer lines) result in a posterior that is shifted to the right. This is an indication of prior sensitivity for those parameters (although not so outspoken). Looking at the bottom row, we see how changing the weight of the likelihood (aka, giving more decisive weight to our data) results in changes in the peakedness of the posterior probability distribution. This indicates likelihood sensitivity, which is what you (most of the time) hope to see: results are determined by the data.

We can apply the same visualizations for the other parameters as well. First for the FL condition.

Show the code

powerscale_plot_dens(
  powerscale_sequence(
    M1
  ),
  variable = c(
    "b_Occ1_FL",
    "b_Occ2_FL",
    "b_Occ3_FL"
  )
)

Next for the MT condition as well.

Show the code

powerscale_plot_dens(
  powerscale_sequence(
    M1
  ),
  variable = c(
    "b_Occ1_MT",
    "b_Occ2_MT",
    "b_Occ3_MT"
  )
)

These visualizations show a similar pattern than the one we saw earlier for the NoSub condition: changing the prior distribution for the parameters related to occasions 2 and 3 will also change the posterior probability distribution. This is most outspoken for the FL condition at occasion 3. Also, we see how for all these parameters that they are sensitive for changes in the likelihood a well.

Until now, we have relied on visual inspection, which is of course prone to subjective interpretation and bias. In their paper Kallioinen et al. (2023) also introduce a diagnostic measure that summarises the distance between the actual posterior probability distribution based on our priors and the perturbed poster probability distributions based on the power-scaling methodology: \(D_{CJS}\). Also, they provide a guideline threshold to interpret this diagnostic measure: \(D_{CJS} \ge 0.05\) flags respectively prior or likelihood sensitivity.

If we use this threshold to decide on prior and/or likelihood sensitivity, we can identify 4 situations that are shown in the next table:

Likelihood_sensitivity	Prior sensitivity
Likelihood_sensitivity	no	yes
no	-	likelihood noninformativity
yes	likelihood domination	prior-data conflict
Source: Kallioinen et al. (2023), page7

What most researchers will hope for, is the situation that there is likelihood domination: the results are not sensitive for changes in the prior and are mainly dominated by the likelihood (our data).

To get a diagnostic summary of the power-scaling methodology, using this \(D_{CJS}\) metric, we can use the powerscale_sensitivity()function, where we refer to the object that contains our model results. Let’s apply it here.

Show the code

powerscale_sensitivity(M1)

Sensitivity based on cjs_dist:
# A tibble: 12 × 4
   variable      prior likelihood diagnosis          
   <chr>         <dbl>      <dbl> <chr>              
 1 b_Occ1_NoSub 0.0427     0.110  -                  
 2 b_Occ1_FL    0.0163     0.104  -                  
 3 b_Occ1_MT    0.0182     0.0902 -                  
 4 b_Occ2_NoSub 0.0328     0.112  -                  
 5 b_Occ2_FL    0.0887     0.167  prior-data conflict
 6 b_Occ2_MT    0.116      0.207  prior-data conflict
 7 b_Occ3_NoSub 0.0771     0.149  prior-data conflict
 8 b_Occ3_FL    0.253      0.356  prior-data conflict
 9 b_Occ3_MT    0.195      0.299  prior-data conflict
10 sigma        0.114      0.316  prior-data conflict
11 prior_b      0.0190     0.0202 -                  
12 prior_sigma  0.0618     0.0943 prior-data conflict

The output shows how we got a number of prior-likelihood conflicts in our results. In our case, this is most probably due to the fact that we have set a rather narrow prior probability distribution for the means which is situated at a different place than the actual posterior probabiility distribution. This is most apparent for the parameter b_Occ3_FL. The following visualization shows how far our prior is away from the actual posterior probability distribution, with really no overlap between both.

Show the code

as_draws_df(M1) %>%
  select("prior_b", "b_Occ3_FL") %>%
  pivot_longer(
    everything()
  ) %>%
  mutate(
    probability_dist = case_when(
      name == "prior_b" ~ "prior probability",
      name == "b_Occ3_FL" ~ "posterior probability"
    )
  ) %>%
  ggplot(
    aes(
    x = value,
    fill = probability_dist
    )
  ) +
  stat_halfeye(
    alpha = .45
  ) +
  scale_fill_discrete('') + 
  scale_y_continuous(name = "", breaks = NULL)

Priors version 2

Ok. Time to set some more sensible priors and check how strong these more sensible priors influence the results.

Let’s say that we know from previous research that on average students score around 100 for this type of test. So, we can use a normal distribution with mean equal to 100 as a prior probability distribution for all the 3 averages at the pretest. We set the prior somewhat wide by applying a standard deviation of 5.7 (remember that the SD of the distribution of the scores themselves was somewhere around 7 points). We aslo assume that students will progress over time. So, for the second measurement occasion averages we use a normal distribution with a mean of 105 and for the third measurement occasion averages we use a normal distribution with a mean of 110. For all these priors we incorporate some amount of uncertainty, reflected in a standard deviation of 5.7 for all these prior probability distributions.

The followin code shows how to set these priors.

Show the code

Custom_prior2 <- c(
  set_prior(
    "normal(100,5.7)",
    class = "b",
    coef = "Occ1_NoSub"
  ),
  set_prior(
    "normal(100,5.7)",
    class = "b",
    coef = "Occ1_MT"
  ),
  set_prior(
    "normal(100,5.7)",
    class = "b",
    coef = "Occ1_FL"
  ),
  set_prior(
    "normal(105,5.7)",
    class = "b",
    coef = "Occ2_NoSub"
  ),
  set_prior(
    "normal(105,5.7)",
    class = "b",
    coef = "Occ2_MT"
  ),
  set_prior(
    "normal(105,5.7)",
    class = "b",
    coef = "Occ2_FL"
  ),
  set_prior(
    "normal(110,5.7)",
    class = "b",
    coef = "Occ3_NoSub"
  ),
  set_prior(
    "normal(110,5.7)",
    class = "b",
    coef = "Occ3_MT"
  ),
  set_prior(
    "normal(110,5.7)",
    class = "b",
    coef = "Occ3_FL"
  )
)

Now we are ready to estimate the model again, making use of these priors.

Show the code

M2 <- brm(
    fluency ~ -1 + Occ1_NoSub + Occ1_FL + Occ1_MT +
    Occ2_NoSub + Occ2_FL + Occ2_MT +
    Occ3_NoSub + Occ3_FL + Occ3_MT,  
  data = Subtitles,
  cores = 4,
  backend = "cmdstanr",
  seed = 1975,
  prior = Custom_prior2,
  sample_prior = "yes"
)

Time to check the prior sensitivity of our results.

Show the code

powerscale_sensitivity(M2)

Sensitivity based on cjs_dist:
# A tibble: 10 × 4
   variable       prior likelihood diagnosis
   <chr>          <dbl>      <dbl> <chr>    
 1 b_Occ1_NoSub 0.0173      0.0952 -        
 2 b_Occ1_FL    0.00800     0.0938 -        
 3 b_Occ1_MT    0.00791     0.0901 -        
 4 b_Occ2_NoSub 0.0146      0.0913 -        
 5 b_Occ2_FL    0.0115      0.0901 -        
 6 b_Occ2_MT    0.0170      0.103  -        
 7 b_Occ3_NoSub 0.0245      0.103  -        
 8 b_Occ3_FL    0.0457      0.131  -        
 9 b_Occ3_MT    0.0150      0.0848 -        
10 sigma        0.00572     0.187  -

A miracle! As we can see, no issues any more. Our results are not too dependent on the priors. Let’s visualize this for the more troublesome parameters in our first iteration of the model: the averages for occasion 3.

Show the code

powerscale_plot_dens(
  powerscale_sequence(
    M2
  ),
  variable = c(
    "b_Occ1_MT",
    "b_Occ2_MT",
    "b_Occ3_MT"
  )
)

See how this time only power-scaling the likelihood has an impact on the posterior probability distribution.

Wrap-up

Priors are a key concept in Bayesian modelling and they can be “a pain in the ass”. Reviewers might reject your work, arguing that Bayesian models are too dependent on prior knowledge incorporated in the analyses. Making use of the priorsense package you can evaluate this statement, not only for the sake of the reviewers but mainly for the sake of good research. I hope this post shows that critically evaluating the prior sensitivity is no rocket science for the applied researcher. Have fun!

References

Depaoli, Rens van de Schoot , Duco Veen , Laurent Smeets , Sonja D. Winter , Sarah. 2020. “A Tutorial on Using the Wambs Checklist to Avoid the Misuse of Bayesian Statistics.” In. Routledge.

Frumuselu, Anca Daniela, Sven De Maeyer, Vincent Donche, and María del Mar Gutiérrez Colon Plana. 2015. “Television Series Inside the EFL Classroom: Bridging the Gap Between Teaching and Learning Informal Language Through Subtitles.” Linguistics and Education 32 (December): 107–17. https://doi.org/10.1016/j.linged.2015.10.001.

Kallioinen, Noa, Topi Paananen, Paul-Christian Bürkner, and Aki Vehtari. 2023. “Detecting and Diagnosing Prior and Likelihood Sensitivity with Power-Scaling.” Statistics and Computing 34 (1): 57. https://doi.org/10.1007/s11222-023-10366-5.