library(tidymodels) # general tidymodels packages
library(skimr) # fancy way of inspecting data, not necessary
library(ggplot2)
The past month or so, I’ve become increasingly intrigued by the tidymodels framework for doing modeling in R, especially after hearing an interview with Julia Silge on the Not so standard deviations podcast with Roger Peng and Hilary Parker.
I envision myself writing more posts on this framework and applications to case studies from linguistics, but since this is the first post, I’ll just share some thoughts I currently have regarding tidymodels.
Tidymodels: first thoughts
- Just like with the tidyverse, I love the modularity and relatively clear sequence of steps that can guide you through “any” analysis. Although, we’re not quite there yet in terms of straightforward applicability.
- I love the tutorials and the “tidytuesday” series on YouTube. Those case studies really provide excellent showcases for how to do common things.
- The idea of making the interfaces to different models more uniform so you can be more efficient.
Let’s talk metaphor for a second:
- The conceptual metaphor of DATA IS MONEY, expressed in phrasings like You need to spend your data budget wisely are helpful. I generally don’t see studies in linguistics that split the data first in a training and testing data set, although we probably should be doing that.
- The conceptual metaphor of PREPPING DATA IS FOLLOWING A RECIPE (the recipes package ) is highly amusing and makes sense.
- The evocative names of the other packages in the core tidymodels set are good too:
rsample
for resampling,parsnip
for modeling (this is a parsnip, in Dutch it’s called pastinaak),workflows
for defining workflows,tune
for tuning,yardstick
for measing performance of your models,broom
for cleaning the output of models…
What I don’t super like about tidymodels, is that as of yet, sometimes it’s very hard to see what some tidymodels equivalents to super common modeling operations are.
- For instance, I don’t really know how to do an Multiple Correspondence Analysis (MCA) within tidymodels, even though the Principle Components Analysis (PCA) is very well represented in the tutorials and examples. I’ve tried some
recipes::step_dummy()
to get to similar results as I would have gotten withFactoMineR
orca
, but I find the results not similar enough to surrender my dimension reductions for qualitative variables completely to tidymodels. (Julia Silge and friends, if you are reading this, please figure this out for me.) - I kind of hate that most analyses just stop at the “Oh I collected the metrics, time to turn off my computer” moment. I don’t think the analysis is finished after calculating a model’s performance; this interpretation is often lacking, and I don’t know if that’s a general quirk of data scientists or just specific to the showcases of these packages, but I wish it didn’t stop there.
- Finally, we come to the topic of this post: how do I decide whether or not to keep interaction effects. In “normal” modeling, this is quite straightforward:
Step 1. Make a model with multiple predictors, no interaction.
Step 2. Make another model with the supposed interaction.
Step 3. anova(model1, model2).
If the p value is significant,
the more complex model (the one with the interaction) is what you want.
But what about the tidymodels approach to anovas?
Googling “tidymodels anova” will bring you to a lot of pages. Unfortunately, they are not super useful in answering our question:
RQ. How do I do a simple comparison of models to decide if the interaction is valid?
This page takes you to a broom
function; this one to the excellent Tidy modeling with R book, more specifically, the chapter on comparing models with resampling. And I agree, resampling seems a very good way to compare models, and I’ve tried to apply that that page to the case study I will present below (nothing too crazy, just the iris
dataset), but if I follow the guides there, I end up with no difference between the model with and without interactions, which is not what our anova will say. So yeah. 🤷️
Let’s load packages
Let’s look at the data
%>% head() iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
::skim(iris) skimr
Name | iris |
Number of rows | 150 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |
Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |
Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |
Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |
What I will be investigating, is if Petal.Length can be predicted by Petal.Width and the Species. But we want to know if there is an interaction between Petal.Width and Species. In other words, our main model will look something like this:
Petal.Length ~ Petal.Width + (or *) Species
Let’s plot the data:
%>%
iris ggplot(aes(Petal.Width, Petal.Length)) +
geom_point(aes(color = Species)) +
geom_smooth(method = lm, color = "orange",
formula = 'y~x') +
geom_smooth(method = lm, aes(color = Species),
formula = 'y~x') +
theme_minimal()
We can see three nice groups of Species. Our general linear model smooth (orange) doesn’t seem too bad, but it doesn’t take into account that there may be an interaction between the Species and the Petal.Width, as evidenced by the three different slopes per Species.
The “normal” way of testing this
The code will be quite short:
<- lm(Petal.Length ~ Petal.Width * Species, data = iris)
mod1 # summary(mod1)
<- lm(Petal.Length ~ Petal.Width + Species, data = iris)
mod2 # summary(mod2)
anova(mod1, mod2) %>% tidy()
# A tibble: 2 × 7
term df.residual rss df sumsq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Petal.Length ~ Petal.Width *… 144 18.8 NA NA NA NA
2 Petal.Length ~ Petal.Width +… 146 20.8 -2 -2.02 7.72 6.53e-4
We see a very small p value (0.00065) so we know that there is a difference between mod1
(no interaction) and mod2
(with interaction), so we have to choose the one with interaction.
Done! Or not?
Tidymodels
Tidymodels is a lot more verbose than just these three lines of code, but that’s actually a good thing: you can are more conscious of the steps involved, more explicit and can easily add more models. However, here it will feel a bit redundant, but bear with me.
Data budget: rsample
set.seed(1234)
<- initial_split(iris, strata = Species, prop = 0.8)
iris_split <- training(iris_split)
iris_train <- testing(iris_split) iris_test
This is kind of the big difference compared to the normal modeling. We split the data (and you can even make folds, but that’s not for today) so that we can gauge the model’s effectiveness later on, before reporting the model itself.
Preprocessing: recipes
# the model without interaction
<-
rec_normal recipe(Petal.Length ~ Petal.Width + Species,
data = iris_train) %>% # we train on the training set
step_dummy(all_nominal_predictors()) %>%
step_center(all_numeric_predictors())
# the model with interaction
<-
rec_interaction %>%
rec_normal step_interact(~ Petal.Width:starts_with("Species"))
Notice that we manually create dummy variables (step_dummy
) for all categorical predictors. In this case that’s just Species. We also center the numerical predictors because that’s generally a good idea. For the second model, we just have to add one extra step, that is declaring the interactions. You can check the results of this recipe with the prep()
function which can then be followed by the bake(new_data = NULL)
function to see it in action.
Model selection: parsnip
<-
iris_model linear_reg() %>%
set_engine("lm") %>% # if you want different engines, this is where you would do that
set_mode("regression")
Workflows: workflows
# normal workflow
<-
iris_wf workflow() %>%
add_model(iris_model) %>%
add_recipe(rec_normal)
# interaction workflow
<-
iris_wf_interaction %>%
iris_wf update_recipe(rec_interaction)
Once again, we can easily recycle workflows. In workflows we bring together the recipe we made for preprocessing and the model we selected for the analysis. Note that we haven’t run anything yet.
Fitting
Here I’m making use of last_fit()
on the split object. This makes sure the data is trained on the training dataset and evaluated on the test dataset. But you can of course also fit()
on the training or test set separately.
<-
iris_normal_lf last_fit(iris_wf,
split = iris_split)
<-
iris_inter_lf last_fit(iris_wf_interaction,
split = iris_split)
How to anova?
This is where I was stuck for the longest time. The answer is actually surprisingly simple: we just use the normal anova()
function, but we need to extract the linear model first. We can do that with extract_fit_engine()
.
<- iris_normal_lf %>% extract_fit_engine()
normalmodel <- iris_inter_lf %>% extract_fit_engine()
intermodel
anova(normalmodel, intermodel) %>% tidy()
# A tibble: 2 × 7
term df.residual rss df sumsq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ..y ~ Petal.Width + Species_… 116 17.8 NA NA NA NA
2 ..y ~ Petal.Width + Species_… 114 16.1 2 1.72 6.10 0.00304
Bam! Once again, p is significant.
But, why go through all this trouble? Keep reading to get metrics and reasons.
Get metrics: yardstick
Now that we know that the interaction model is the better one, we can also quickly get some metrics for that model. The normal model is now irrelevant.
%>% collect_metrics() iris_inter_lf
# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 0.318 Preprocessor1_Model1
2 rsq standard 0.968 Preprocessor1_Model1
Could we have found this with “normal” modeling? I guess so, but now we have also already tested it against “new data”, i.e., the test data set we set aside in the beginning. So we know the model can predict reasonably well and does not overfit. There is no data leakage.
%>% glance() mod2
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.955 0.954 0.378 1036. 3.70e-98 3 -64.8 140. 155.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
What’s in my model?
This is the step that most tutorials seem to neglect because it’s not really in tidymodels; it’s in general data analysis. But what we usually report is not the root mean square deviation rmse
(okay this just looks like a bunch of nouns strung together, but see here) or the R squared R² rsq
, but the whole model.
intermodel
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) Petal.Width
3.7894 1.0309
Species_versicolor Species_virginica
1.8314 3.0006
Petal.Width_x_Species_versicolor Petal.Width_x_Species_virginica
1.0536 -0.2426
%>% tidy() intermodel
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 3.79 0.197 19.3 1.11e-37
2 Petal.Width 1.03 0.230 4.49 1.73e- 5
3 Species_versicolor 1.83 0.557 3.29 1.34e- 3
4 Species_virginica 3.00 0.585 5.13 1.22e- 6
5 Petal.Width_x_Species_versicolor 1.05 0.652 1.62 1.09e- 1
6 Petal.Width_x_Species_virginica -0.243 0.621 -0.391 6.97e- 1
%>% glance() intermodel
# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.958 0.956 0.375 516. 1.55e-76 5 -49.6 113. 133.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
And finally, we can plot the model’s predictions against the tested values. If we draw an R² plot we can see that it fits pretty well.
%>%
iris_inter_lf collect_predictions() %>%
ggplot(aes(.pred, Petal.Length)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "orange") +
labs(x = "Predicted Petal.Length",
y = "Observed Petal.Length",
title = "R² plot") +
theme_minimal()
So, tidymodels is definitely a longer style of analysis but you can get much more out of your data. And isn’t that what we ultimately want? We have:
- shown that indeed the interaction is there: the petal width can predict the petal length, but there is an influence of the species
- made a model that is protected against overfitting
- tested said model against some of the data in the model (so we know it’s more robust)
- observed the fit of the model (R²)
Of course, these are “just plants” (sorry biased, love languages and I couldn’t get my mint to sprout so may still be vengeful about that). But now there’s a tutorial on how to do simple anova within a tidymodels framework for deciding if you should keep an interaction or not.
Disclaimer on the iris data set
In recent years, it has become more public knowledge that the ubiquitous iris
data set was first published in the Annals of Eugenics in 1936 by Ronald Fisher. As this tweet and this post point out, it’s perhaps not the best thing that this data set is so readily used in data examples. Some proposals for other data sets can be found here.
Obviously, eugenics is bad – think of how important this issue was to the projected future in Star Trek with Khan and friends – and I agree that iris
is kind of boring. But, just like the Annals of Eugenics rebranded itself to the Annals of Human Genetics, distancing themselves from that terrible phase in science (and we still see the beast rear its head once in a while), I can’t help but think that this silly description of flowers is quite innocent. Perhaps, death of the scholar does exist? After all, if we have to throw away iris
because of Fisher’s bad personal views (once again, not good), do we also have to throw out the stats techniques he developed? I know that I’ve used the Fisher-Yates Exact Test to calculate mutual attraction.
And let’s also talk about Karl Pearson, who was the first editor of the Annals of Eugenics. Reading up on his wiki bio was not pleasant either. So should we throw out the Pearson correlation? Or even worse: the p-value (which was first formally introduced in the Pearson’s chi-square test)? Gone with Principled Components Analysis! Histograms? History you mean!
The point is that it is necessary to treat the data and work as separate from their personal life. That means that I agree with the efforts to rename buildings that were named after Fisher or Pearson at UCL, but that at the same time we should still be okay with using iris
or statistical techniques developed by these people. The best two arguments for not using iris
are that it’s boring and that there exists a penguins
dataset (found here).