Show the code
# Scored (0 = missed, 1 = scored)
table(KobeBryant$Scored)
September 12, 2019
When researchers commonly apply statistical hypothesis testing they rely heavily on the fameous p-values. These p-values have the following meaning: how big is the probability that I would encounter similar data as mine given that the null-hypothesis is true . So, actually, these statistical routines inform us on the null-hypothesis (which in most cases is simply the hypothesis that there is no effect or no difference at all). Many scholars have started to criticize this p-value oriented approach, discussing it’s risks and flaws. It has even brought the American Statistical Association to the point that they published a statement on p-values (see article).
One critique that is expressed, refers to the goal of “good science”. According to K. P. Burnham, Anderson, and Huyvaert (2011) good scientists
“make a major effort to think hard about the science question and then define several plausible alternative hypotheses”.
When we rely only on p-values we are actually defining two hypotheses: the null hypothesis (which in many cases doesn’t make sense anyway) and the alternative hypothesis. A better approach would be to formulate a set of alternative ‘working hypotheses’ and evaluate them.
To criticize things is one thing. Providing solutions is something different. One of the upcoming alternatives is the idea of modelbased inference or model selection. The basic idea behind this approach is that a researcher translates alternative hypotheses in different statistical models. Based on measures coming from information theory we then can express the probability of each of the different alternative models. Also, the evidence coming from different models can be combined to make inferences.
In this post we will give a short demo of how we can use such an approach. We will use this approach to inform us on who is right in a discussion that got out of hand…
The discussion was on basketball and it happened at the kitchen table of “a certain family”. More specifically, the discussion was on whether Kobe Bryant was one of the most consistent shotters ever in the NBA. One of the arguments was that his shotsucces was nearly influenced by any extraneous factor. But, which factors had an influence on Kobe Bryant’s shot success? Let’s find out. Here we go with our statistical story. We do our demonstration in R, making use of the AICcmodavg
package.
Data-analyses is big business in the NBA. In each team there is at least one data-analyst gathering data during games to make predictions. One of the things documented in datasets is the position from where a player takes a shot and whether the shot was a success or not…
For our example we have data on each shot that Kobe Bryant took in his NBA-career. The data looks like this (and you can download a .csv file to amuse yourself further with the data)
The variable ‘Scored’ registers whether the shot was successful or not. A simple table shows the numbers…
# Scored (0 = missed, 1 = scored)
table(KobeBryant$Scored)
Besides that variable, we also have the following variables:
The structure of the data looks like this:
Which of these variables can predict shot accuracy?
We have 4 candidate explanatory variables.
How to decide which variables to include in our model?
Well, let’s consult the consultants (or the experts according to themselves), the two kids of the family and the mother.
As one can expect, their opinions are somewhat divergent:
These expert hypotheses can be translated in the following statistical models:
Expert | Model |
---|---|
N. | \(Logit(Scored = 1) = \beta_{0} + \beta_{1}*D\) |
T. | \(Logit(Scored = 1) = \beta_{0} + \beta_{1}*D + \beta_{2}*SC + \beta_{3}*Q2 + \beta_{4}*Q3 + \beta_{5}*Q4 + \beta_{6}*Extra\) |
W. | \(Logit(Scored = 1) = \beta_{0} + \beta_{1}*D + \beta_{1}*LM\) |
D = Distance_in_m ; SC = Shot_clock ; LM = Last_minutes
These models are so-called logistic models where we predict shotsucces (a categorical variable with two categories).
Let’s estimate each candidate model with glm()
.
Now we can introduce the idea of Model-based inference. At the core of this method there is the key idea that there is no such thing as the perfect model. But some models are better approaches of reality than other models. So instead of searching for the one single best model, we want to evaluate the alternative models on their performance.
To evaluate the models, we can make use of different measures, but the most frequently used is the AIC. For our demo we rely on this AIC measure. But be aware that in certain situations AIC can be a sub-optimal choice (see Symonds and Moussalli (2011)).
The AIC can be best defined as:
“… a numerical value by which to rank competing models in terms of information loss in approximating the unknowable truth” (Symonds and Moussalli (2011))
In R we can make use of the aictab( )
function from the AICcmodavg
package. The input for this function is a list of estimated models. Let’s apply it:
The table ranks the models from best to worst approximations of “the truth”. And the lower the AIC the better the model. So, in our case the Model of W (effect of Distance_in_m and Last_minutes) is ranked as best model. The second best model is the model of T (effect of Distance_in_m, and the dummy-variables Q2 - Q4 & Extra), a model that contains more predictors (see column K
in our table that prints the number of parameters including the intercept). Model N is ranked as poorest model.
The column Delta_AIC
gives us the distance in AIC between each model and the best ranked model respectively. So Model T’s AIC is 3.19 higher than Model W. The model of N results in an AIC that is 14.88 higher than the best ranked model (Model W).
If we look at the column LL
we get the log-likelihoods for each model. Model T (the more complex model) has a higher LL than the model of W. This would favor the model of T. But, as Model W is more parsimonious it gets a lower AIC (where a correction is made on the number of parameters).
Another piece of information in our table is called the Akaike Weight (column AICwt
). A simple interpretation of this weight is as follows. Model W, the best ranked model, has a AICwt
of 0.83. This indicates that this Model W has a 83% probability of being the best approximating model given the set of candidate models. Model T has a lower probability of being the best approximating model: only 16.8% (AICwt
= 0.168). The model of N has a very low probability of being the best approximating model (only 0.04%).
If we take all the information in our table into account, we could conclude that both models Model W and Model T are the two best models in our set of models, with Model W being the best approximating model. The evidence for a model that states that only distance to the ring is an important indicator for shot success (Model N) is very low, so we can almost certainly conclude that distance isn’t all that matters.
But, can we really reject this model? Are there any guidelines? How to decide on which models to further use in our analyses to interpret the results? Should we choose one single model and continue our inferences based on that single model? There is no single answer to these questions, but, as Symonds and Moussalli (2011) state:
“as a coarse guide, models with \(\Delta_i\) values less than 2 are considered to be essentially as good as the best model, and models with \(\Delta_i\) up to 6 should probably not be discounted (Richards (2005)). Above this, model rejection might be considered, and certainly models with \(\Delta_i\) values greater than 10 are sufficiently poorer than the best AIC model as to be considered implausible (K. Burnham and Anderson (2002)).”
In this quote \(\Delta_i\) refers to the column Delta_AIC
in our table. Applying this coarse guide, we can conclude that we can only reject the model of N and that Model T should not be discounted.
Now that we know who’s expert at the kitchen table, we still do not know how e.g. distance influenced the shot success of Kobe Bryant. It is a great thing that we have two possible models, but from which model do we need to extract our information to learn more about the influence of distance to the ring?
In the modelbased inference approach this is tackled by taking into account the relevant information from the different models. This is done by averaging the relevant parameter estimates, after weighting them according to Akaike’s weight for the models. Sounds a bit puzzling? In practice it is not that difficult. Here is the command we can use in R:
The column Estimate
contains the estimate of the effect of Distance_in_m in each model where this variable is added. In each model the estimate is similar (-0.14). The Model-averaged estimate
is also -0.14. To calculate this average estimate, Model_W is given the highest weight (=0.83) and Model_N the lowest (\(\approx 0.00\)). The 95% confidence interval is also a ‘weighted version’, indicating in our case that we are pretty sure that distance towards the ring had a negative impact on the probability to successfully throw the ball in the ring, even for a top-player like Kobe Bryant.
To get an idea of how shot success is influenced by distance, we can plot our estimated model, making use of the plot_model
command from the package sjPlot
.
Seems that Kobe Bryant’s shots were quite accurate!
The typical null-hypothesis significance testing is under pressure. One of the upcoming alternatives is making use of modelbased inference. We hope you get a rough idea about how this approach can be applied. More in-depth reading is of course welcome. The references added in this post can help. A great read as well is Chapter 6 in THE book you must read on statistics: McElreath (2018). This book is also nicely ‘recoded’ making use of modern packages in R in the following book: Statistical Rethinking Recoded.
Have fun while leaRning!