To model is to compare

Model comparison

Statistics

Kobe Bryant

Researchers often rely on null hypothesis testing with p-values to make inferences. Nevertheless the idea of null hypothesis testing comes more and more under pressure. As an alternative a model comparison approach is often suggested; In this short article we briefly introduce this approach and apply it to a toy dataset on all shots of Kobe Bryant in the NBA.

Published

September 12, 2019

When researchers commonly apply statistical hypothesis testing they rely heavily on the fameous p-values. These p-values have the following meaning: how big is the probability that I would encounter similar data as mine given that the null-hypothesis is true . So, actually, these statistical routines inform us on the null-hypothesis (which in most cases is simply the hypothesis that there is no effect or no difference at all). Many scholars have started to criticize this p-value oriented approach, discussing it’s risks and flaws. It has even brought the American Statistical Association to the point that they published a statement on p-values (see article).

One critique that is expressed, refers to the goal of “good science”. According to K. P. Burnham, Anderson, and Huyvaert (2011) good scientists

“make a major effort to think hard about the science question and then define several plausible alternative hypotheses”.

When we rely only on p-values we are actually defining two hypotheses: the null hypothesis (which in many cases doesn’t make sense anyway) and the alternative hypothesis. A better approach would be to formulate a set of alternative ‘working hypotheses’ and evaluate them.

To criticize things is one thing. Providing solutions is something different. One of the upcoming alternatives is the idea of modelbased inference or model selection. The basic idea behind this approach is that a researcher translates alternative hypotheses in different statistical models. Based on measures coming from information theory we then can express the probability of each of the different alternative models. Also, the evidence coming from different models can be combined to make inferences.

In this post we will give a short demo of how we can use such an approach. We will use this approach to inform us on who is right in a discussion that got out of hand…

The story of Kobe Bryant

The discussion was on basketball and it happened at the kitchen table of “a certain family”. More specifically, the discussion was on whether Kobe Bryant was one of the most consistent shotters ever in the NBA. One of the arguments was that his shotsucces was nearly influenced by any extraneous factor. But, which factors had an influence on Kobe Bryant’s shot success? Let’s find out. Here we go with our statistical story. We do our demonstration in R, making use of the AICcmodavg package.

Data-analyses is big business in the NBA. In each team there is at least one data-analyst gathering data during games to make predictions. One of the things documented in datasets is the position from where a player takes a shot and whether the shot was a success or not…

For our example we have data on each shot that Kobe Bryant took in his NBA-career. The data looks like this (and you can download a .csv file to amuse yourself further with the data)

The variable ‘Scored’ registers whether the shot was successful or not. A simple table shows the numbers…

Show the code

# Scored (0 = missed, 1 = scored)
table(KobeBryant$Scored)

Besides that variable, we also have the following variables:

distance to the ring (‘Distance_in_m’);
the quarter in which the shot was thrown (5 dummy variables: ‘Q1’, ‘Q2’, ‘Q3’, ‘Q4’, ‘Extra’);
the number of seconds on the shot-clock (‘Shot_clock’);
and a dummy variable for shots taken in the final 5 minutes of the game (‘Last_minutes’).

The structure of the data looks like this:

Which of these variables can predict shot accuracy?

The experts’ ideas

We have 4 candidate explanatory variables.

How to decide which variables to include in our model?

Well, let’s consult the consultants (or the experts according to themselves), the two kids of the family and the mother.

As one can expect, their opinions are somewhat divergent:

N. (‘basketball hater’) thinks that only distance to the ring is important.
T. (‘old school basketball player’) thinks that besides distance to the ring also the number of seconds on the shot clock and the quarter in which the shot is fired, are important factors.
W. (‘Kobe Bryant-fan’) thinks that Kobe scores more in the final 5 minutes, no matter how far he is from the ring. Under stressful situations Kobe was at his best, according to him.

These expert hypotheses can be translated in the following statistical models:

Expert	Model
N.	\(Logit(Scored = 1) = \beta_{0} + \beta_{1}*D\)
T.	\(Logit(Scored = 1) = \beta_{0} + \beta_{1}D + \beta_{2}SC + \beta_{3}Q2 + \beta_{4}Q3 + \beta_{5}Q4 + \beta_{6}Extra\)
W.	\(Logit(Scored = 1) = \beta_{0} + \beta_{1}D + \beta_{1}LM\)

D = Distance_in_m ; SC = Shot_clock ; LM = Last_minutes

These models are so-called logistic models where we predict shotsucces (a categorical variable with two categories).

Let’s estimate each candidate model with glm().

Modelbased inference in action

Now we can introduce the idea of Model-based inference. At the core of this method there is the key idea that there is no such thing as the perfect model. But some models are better approaches of reality than other models. So instead of searching for the one single best model, we want to evaluate the alternative models on their performance.

To evaluate the models, we can make use of different measures, but the most frequently used is the AIC. For our demo we rely on this AIC measure. But be aware that in certain situations AIC can be a sub-optimal choice (see Symonds and Moussalli (2011)).

The AIC can be best defined as:

“… a numerical value by which to rank competing models in terms of information loss in approximating the unknowable truth” (Symonds and Moussalli (2011))

In R we can make use of the aictab( ) function from the AICcmodavg package. The input for this function is a list of estimated models. Let’s apply it:

The table ranks the models from best to worst approximations of “the truth”. And the lower the AIC the better the model. So, in our case the Model of W (effect of Distance_in_m and Last_minutes) is ranked as best model. The second best model is the model of T (effect of Distance_in_m, and the dummy-variables Q2 - Q4 & Extra), a model that contains more predictors (see column K in our table that prints the number of parameters including the intercept). Model N is ranked as poorest model.

The column Delta_AIC gives us the distance in AIC between each model and the best ranked model respectively. So Model T’s AIC is 3.19 higher than Model W. The model of N results in an AIC that is 14.88 higher than the best ranked model (Model W).

If we look at the column LL we get the log-likelihoods for each model. Model T (the more complex model) has a higher LL than the model of W. This would favor the model of T. But, as Model W is more parsimonious it gets a lower AIC (where a correction is made on the number of parameters).

Another piece of information in our table is called the Akaike Weight (column AICwt). A simple interpretation of this weight is as follows. Model W, the best ranked model, has a AICwt of 0.83. This indicates that this Model W has a 83% probability of being the best approximating model given the set of candidate models. Model T has a lower probability of being the best approximating model: only 16.8% (AICwt = 0.168). The model of N has a very low probability of being the best approximating model (only 0.04%).

If we take all the information in our table into account, we could conclude that both models Model W and Model T are the two best models in our set of models, with Model W being the best approximating model. The evidence for a model that states that only distance to the ring is an important indicator for shot success (Model N) is very low, so we can almost certainly conclude that distance isn’t all that matters.

But, can we really reject this model? Are there any guidelines? How to decide on which models to further use in our analyses to interpret the results? Should we choose one single model and continue our inferences based on that single model? There is no single answer to these questions, but, as Symonds and Moussalli (2011) state:

“as a coarse guide, models with \(\Delta_i\) values less than 2 are considered to be essentially as good as the best model, and models with \(\Delta_i\) up to 6 should probably not be discounted (Richards (2005)). Above this, model rejection might be considered, and certainly models with \(\Delta_i\) values greater than 10 are sufficiently poorer than the best AIC model as to be considered implausible (K. Burnham and Anderson (2002)).”

In this quote \(\Delta_i\) refers to the column Delta_AIC in our table. Applying this coarse guide, we can conclude that we can only reject the model of N and that Model T should not be discounted.

Now that we know who’s expert at the kitchen table, we still do not know how e.g. distance influenced the shot success of Kobe Bryant. It is a great thing that we have two possible models, but from which model do we need to extract our information to learn more about the influence of distance to the ring?

In the modelbased inference approach this is tackled by taking into account the relevant information from the different models. This is done by averaging the relevant parameter estimates, after weighting them according to Akaike’s weight for the models. Sounds a bit puzzling? In practice it is not that difficult. Here is the command we can use in R:

The column Estimate contains the estimate of the effect of Distance_in_m in each model where this variable is added. In each model the estimate is similar (-0.14). The Model-averaged estimate is also -0.14. To calculate this average estimate, Model_W is given the highest weight (=0.83) and Model_N the lowest (\(\approx 0.00\)). The 95% confidence interval is also a ‘weighted version’, indicating in our case that we are pretty sure that distance towards the ring had a negative impact on the probability to successfully throw the ball in the ring, even for a top-player like Kobe Bryant.

To get an idea of how shot success is influenced by distance, we can plot our estimated model, making use of the plot_model command from the package sjPlot.

Show the code

library(sjPlot)

Plot_probs <- plot_model(Model_W,
                type      = c("pred"),
                terms     = c("Distance_in_m [all]"),
                axis.title = c("Distance", "Predicted probablity of succes"),
                title = "Relation between distance to the ring and shot succes")

Plot_probs

Seems that Kobe Bryant’s shots were quite accurate!

To conclcude

The typical null-hypothesis significance testing is under pressure. One of the upcoming alternatives is making use of modelbased inference. We hope you get a rough idea about how this approach can be applied. More in-depth reading is of course welcome. The references added in this post can help. A great read as well is Chapter 6 in THE book you must read on statistics: McElreath (2018). This book is also nicely ‘recoded’ making use of modern packages in R in the following book: Statistical Rethinking Recoded.

Have fun while leaRning!

References

Burnham, Kenneth P., David R. Anderson, and Kathryn P. Huyvaert. 2011. “AIC Model Selection and Multimodel Inference in Behavioral Ecology: Some Background, Observations, and Comparisons.” Behavioral Ecology and Sociobiology 65 (1): 23–35.

Burnham, KP, and DR Anderson. 2002. Model Selection and Multimodel Inference. 2nd ed. New York: Springer.

McElreath, Richard. 2018. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Chapman; Hall/CRC.

Richards, SA. 2005. “Testing Ecological Theory Using the Informationtheoretic Approach: Examples and Cautionary Results.” Ecology 86: 2805–14.

Symonds, Matthew RE, and Adnan Moussalli. 2011. “A Brief Guide to Model Selection, Multimodel Inference and Model Averaging in Behavioural Ecology Using Akaike’s Information Criterion.” Behav Ecol Sociobiol 65: 13–21.