Why use multilevel models?
http://elevanth.org/blog/2017/08/24/multilevel-regression-as-default/
What’s the algebra?
http://elevanth.org/blog/2017/09/07/metamorphosis-multilevel-model/
What diagnostics are there for making sure your model fit alright?
http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html
See also:
Multilevel models are about the partial pooling of information. They’re useful whenever data are organized into clusters or groups. For instance:
library("rstan")
library("tidyverse")
library("bayesplot")
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
theme_set(theme_minimal())
knitr::opts_chunk$set(fig.align = "center")
We start with a simple binary outcome model with a link function:
\[y_i \sim \text{Bernoulli}(p_i)\] \[p_i = \text{logit}^{-1}(\alpha_{unit[i]})\]
What makes it multilevel is that those \(\alpha\) values come from a distribution:
\[\alpha_j \sim \text{Normal}(a, \tau)\]
We can put some fairly generic weakly informative priors on the parameters of that distribution:
\[a \sim \text{Normal}(0, 10)\] \[\tau \sim \text{Cauchy}^+(0, 1)\]
This model uses data from Richard McElreath’s rethinking
package, and implements the model from the second blog post above. Install the package if you need to.
# devtools::install_github("rmcelreath/rethinking")
data("bangladesh", package = "rethinking")
# subset the data
set.seed(123)
bangladesh <-
as_tibble(bangladesh) %>%
sample_n(100)
# look at the data
glimpse(bangladesh)
## Observations: 100
## Variables: 6
## $ woman <int> 557, 1524, 791, 1706, 1816, 88, 1019, 1720, ...
## $ district <int> 15, 46, 25, 52, 57, 1, 30, 52, 31, 27, 58, 2...
## $ use.contraception <int> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ living.children <int> 4, 1, 4, 4, 4, 4, 4, 2, 1, 3, 1, 1, 4, 3, 3,...
## $ age.centered <dbl> 14.4400, -12.5600, 4.4400, 3.4400, 6.4400, 0...
## $ urban <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
The data are on women’s contraceptive use in Bangladesh. The outcome use.contraception
is binary, and each woman lives in a district
. We won’t use the other variables.
We’re subsetting the data because the computational problem isn’t so obvious with the full data.
We created a problem when we sampled, though! Since we sampled, we no longer know if our districts are represented by integers from 1 to J.
unique(bangladesh$district)
## [1] 15 46 25 52 57 1 30 31 27 58 40 33 5 53 14 18 41 37 61 39 43 34 7
## [24] 28 45 12 17 6 20 50 44 21 35 38 24 51 8 36 16 9
How can we fix that? One way is by playing with data types:
unique(as.integer(as.factor(bangladesh$district)))
## [1] 9 33 16 36 38 1 19 20 17 39 28 21 2 37 8 12 29 25 40 27 30 22 4
## [24] 18 32 7 11 3 13 34 31 14 23 26 15 35 5 24 10 6
With that in mind, let’s format the data for Stan:
d <- list(
y = bangladesh$use.contraception,
district = as.integer(as.factor(bangladesh$district))
)
# Indices
d$N <- length(d$y)
d$J <- length(unique(d$district))
Let’s have a closer look at what’s going on with all those indices:
data {
...
// number of districts
int<lower = 1> J;
// district index
int<lower = 1, upper = J> district[N];
...
}
...
transformed parameters {
vector<lower = 0, upper = 1>[N] p;
for (i in 1:N) {
p[i] = inv_logit(alpha[district[i]]);
}
}
...
Note: this isn’t the most efficient way to fit a logit model in Stan! There’s a shortcut function bernoulli_logit
:
model {
...
// likelihood
for (i in 1:N) {
y[i] ~ bernoulli_logit(alpha[district[i]]);
}
}
fit_cp <- stan("stan/multilevel_logit_cp.stan", data = d, chains = 2)
## Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
print(fit_cp, pars = "tau")
## Inference for Stan model: multilevel_logit_cp.
## 2 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## tau 1.22 0.1 0.58 0.32 0.81 1.15 1.52 2.65 33 1.07
##
## Samples were drawn using NUTS(diag_e) at Thu May 16 11:28:50 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
mcmc_trace(as.array(fit_cp), pars = "tau")
Rewrite the link part like this:
\[\text{logit}(p_i) = a + \tau * z_{unit[i]}\]
Then the distribution for the random effects becomes this:
\[z_j \sim \text{Normal}(0, 1)\]
This can be easier to sample from using HMC. In Stan, we’d change the model to this:
...
parameters {
vector[J] z;
real a;
real<lower = 0> tau;
}
transformed parameters {
vector<lower = 0, upper = 1>[N] p;
for (i in 1:N) {
p[i] = inv_logit(a + tau * z[district[i]]);
}
}
model {
// priors
z ~ normal(0, 1);
...
}
fit_ncp <- stan("stan/multilevel_logit_ncp.stan", data = d, chains = 2)
print(fit_ncp, pars = "tau")
## Inference for Stan model: multilevel_logit_ncp.
## 2 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## tau 1.14 0.03 0.54 0.18 0.78 1.1 1.45 2.37 388 1.01
##
## Samples were drawn using NUTS(diag_e) at Thu May 16 11:28:52 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
mcmc_trace(as.array(fit_ncp), pars = "tau")
nlschools
is a data set of test scores of 8th-grade pupils from different schools in the Netherlands. (?MASS::nlschools
for more information)
data("nlschools", package = "MASS")
# sample a subset
set.seed(123)
nlschools <-
as_tibble(nlschools) %>%
sample_n(200)
# look at the data
glimpse(nlschools)
## Observations: 200
## Variables: 6
## $ lang <int> 48, 26, 46, 49, 45, 48, 47, 51, 54, 53, 54, 41, 33, 52, ...
## $ IQ <dbl> 12.0, 9.0, 11.0, 14.0, 11.0, 13.0, 13.0, 17.5, 12.5, 13....
## $ class <fct> 8080, 19980, 11282, 22480, 24080, 2180, 14880, 22780, 15...
## $ GS <int> 28, 37, 28, 13, 26, 19, 32, 30, 30, 24, 37, 24, 24, 17, ...
## $ SES <int> 40, 10, 33, 33, 30, 23, 23, 27, 40, 23, 50, 30, 25, 50, ...
## $ COMB <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...
We’ll model language test scores lang
as a function of class size GS
and family socio-economic status SES
.
# format the data into a list for Stan
d_nls <- list()
# vector of scaled outcome
d_nls$y <- as.vector(scale(nlschools$lang))
# matrix of scaled covariates
d_nls$X <-
nlschools %>%
select(GS, SES) %>%
scale() %>%
as.matrix()
# numeric index from 1..J for the groups
d_nls$group <- as.integer(as.factor(as.integer(nlschools$class)))
All the indices!
d_nls$N <- length(d_nls$y)
d_nls$K <- ncol(d_nls$X)
d_nls$J <- length(unique(d_nls$group))
Have a look at multilevel_regression_cp.stan
and multilevel_regression_ncp.stan
in the stan
folder.
First, fit the centered version:
# fit the centered version of the regression
# check model summaries and trace plots for model coefficients
Then, fit the noncentered version:
# fit the noncentered version of the regression
# check model summaries and trace plots for model coefficients