Goal

  • Understand the concept and uses of multilevel models
  • Fit multilevel models in Stan and understand the syntax
  • Address computational challenges of MCMC by adjusting the models

Relevant readings

Why use multilevel models?

http://elevanth.org/blog/2017/08/24/multilevel-regression-as-default/

What’s the algebra?

http://elevanth.org/blog/2017/09/07/metamorphosis-multilevel-model/

What diagnostics are there for making sure your model fit alright?

http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html

See also:

Introduction

Multilevel models are about the partial pooling of information. They’re useful whenever data are organized into clusters or groups. For instance:

  • multiple observations per individual (in medicine, psychology, sports, panel data…)
  • classrooms, schools, districts in education data
  • geographic regions
  • related demographic categories, especially in political or opinion polling—multilevel regression and poststratification (MRP or “Mister P”) is quite common
  • studies or polls in meta-analyses

Setup

library("rstan")
library("tidyverse")
library("bayesplot")

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
theme_set(theme_minimal())
knitr::opts_chunk$set(fig.align = "center")

Model 1: Binary outcome, no covariates

We start with a simple binary outcome model with a link function:

\[y_i \sim \text{Bernoulli}(p_i)\] \[p_i = \text{logit}^{-1}(\alpha_{unit[i]})\]

What makes it multilevel is that those \(\alpha\) values come from a distribution:

\[\alpha_j \sim \text{Normal}(a, \tau)\]

We can put some fairly generic weakly informative priors on the parameters of that distribution:

\[a \sim \text{Normal}(0, 10)\] \[\tau \sim \text{Cauchy}^+(0, 1)\]

The data: contraception in Bangladesh

This model uses data from Richard McElreath’s rethinking package, and implements the model from the second blog post above. Install the package if you need to.

# devtools::install_github("rmcelreath/rethinking")
data("bangladesh", package = "rethinking")

# subset the data
set.seed(123)
bangladesh <- 
  as_tibble(bangladesh) %>% 
  sample_n(100)

# look at the data
glimpse(bangladesh)
## Observations: 100
## Variables: 6
## $ woman             <int> 557, 1524, 791, 1706, 1816, 88, 1019, 1720, ...
## $ district          <int> 15, 46, 25, 52, 57, 1, 30, 52, 31, 27, 58, 2...
## $ use.contraception <int> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0,...
## $ living.children   <int> 4, 1, 4, 4, 4, 4, 4, 2, 1, 3, 1, 1, 4, 3, 3,...
## $ age.centered      <dbl> 14.4400, -12.5600, 4.4400, 3.4400, 6.4400, 0...
## $ urban             <int> 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...

The data are on women’s contraceptive use in Bangladesh. The outcome use.contraception is binary, and each woman lives in a district. We won’t use the other variables.

We’re subsetting the data because the computational problem isn’t so obvious with the full data.

We created a problem when we sampled, though! Since we sampled, we no longer know if our districts are represented by integers from 1 to J.

unique(bangladesh$district)
##  [1] 15 46 25 52 57  1 30 31 27 58 40 33  5 53 14 18 41 37 61 39 43 34  7
## [24] 28 45 12 17  6 20 50 44 21 35 38 24 51  8 36 16  9

How can we fix that? One way is by playing with data types:

unique(as.integer(as.factor(bangladesh$district)))
##  [1]  9 33 16 36 38  1 19 20 17 39 28 21  2 37  8 12 29 25 40 27 30 22  4
## [24] 18 32  7 11  3 13 34 31 14 23 26 15 35  5 24 10  6

With that in mind, let’s format the data for Stan:

d <- list(
  y = bangladesh$use.contraception,
  district = as.integer(as.factor(bangladesh$district))
)

# Indices
d$N <- length(d$y)
d$J <- length(unique(d$district))

Multilevel model syntax

Let’s have a closer look at what’s going on with all those indices:

data {
  ...
  // number of districts
  int<lower = 1> J;
  // district index
  int<lower = 1, upper = J> district[N];
  ...
}
...
transformed parameters {
  vector<lower = 0, upper = 1>[N] p;
  for (i in 1:N) {
    p[i] = inv_logit(alpha[district[i]]);
  }
}
...

Note: this isn’t the most efficient way to fit a logit model in Stan! There’s a shortcut function bernoulli_logit:

model {
  ...
  // likelihood
  for (i in 1:N) {
    y[i] ~ bernoulli_logit(alpha[district[i]]);
  }
}

The centered version of the model

fit_cp <- stan("stan/multilevel_logit_cp.stan", data = d, chains = 2)
## Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
## http://mc-stan.org/misc/warnings.html#bfmi-low
## Warning: Examine the pairs() plot to diagnose sampling problems
print(fit_cp, pars = "tau")
## Inference for Stan model: multilevel_logit_cp.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
## tau 1.22     0.1 0.58 0.32 0.81 1.15 1.52  2.65    33 1.07
## 
## Samples were drawn using NUTS(diag_e) at Thu May 16 11:28:50 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
mcmc_trace(as.array(fit_cp), pars = "tau")

The noncentered version

Rewrite the link part like this:

\[\text{logit}(p_i) = a + \tau * z_{unit[i]}\]

Then the distribution for the random effects becomes this:

\[z_j \sim \text{Normal}(0, 1)\]

This can be easier to sample from using HMC. In Stan, we’d change the model to this:

...
parameters {
  vector[J] z;
  real a;
  real<lower = 0> tau;
}
transformed parameters {
  vector<lower = 0, upper = 1>[N] p;
  for (i in 1:N) {
    p[i] = inv_logit(a + tau * z[district[i]]);
  }
}
model {
  // priors
  z ~ normal(0, 1);
  ...
}
fit_ncp <- stan("stan/multilevel_logit_ncp.stan", data = d, chains = 2)
print(fit_ncp, pars = "tau")
## Inference for Stan model: multilevel_logit_ncp.
## 2 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=2000.
## 
##     mean se_mean   sd 2.5%  25% 50%  75% 97.5% n_eff Rhat
## tau 1.14    0.03 0.54 0.18 0.78 1.1 1.45  2.37   388 1.01
## 
## Samples were drawn using NUTS(diag_e) at Thu May 16 11:28:52 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
mcmc_trace(as.array(fit_ncp), pars = "tau")

Model 2: Continuous outcome, with covariates

Data

nlschools is a data set of test scores of 8th-grade pupils from different schools in the Netherlands. (?MASS::nlschools for more information)

data("nlschools", package = "MASS")

# sample a subset
set.seed(123)
nlschools <- 
  as_tibble(nlschools) %>% 
  sample_n(200)

# look at the data
glimpse(nlschools)
## Observations: 200
## Variables: 6
## $ lang  <int> 48, 26, 46, 49, 45, 48, 47, 51, 54, 53, 54, 41, 33, 52, ...
## $ IQ    <dbl> 12.0, 9.0, 11.0, 14.0, 11.0, 13.0, 13.0, 17.5, 12.5, 13....
## $ class <fct> 8080, 19980, 11282, 22480, 24080, 2180, 14880, 22780, 15...
## $ GS    <int> 28, 37, 28, 13, 26, 19, 32, 30, 30, 24, 37, 24, 24, 17, ...
## $ SES   <int> 40, 10, 33, 33, 30, 23, 23, 27, 40, 23, 50, 30, 25, 50, ...
## $ COMB  <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...

We’ll model language test scores lang as a function of class size GS and family socio-economic status SES.

# format the data into a list for Stan
d_nls <- list()

# vector of scaled outcome
d_nls$y <- as.vector(scale(nlschools$lang))

# matrix of scaled covariates
d_nls$X <- 
  nlschools %>%
  select(GS, SES) %>%
  scale() %>%
  as.matrix()

# numeric index from 1..J for the groups
d_nls$group <- as.integer(as.factor(as.integer(nlschools$class)))

All the indices!

d_nls$N <- length(d_nls$y)
d_nls$K <- ncol(d_nls$X)
d_nls$J <- length(unique(d_nls$group))

Fit the model

Have a look at multilevel_regression_cp.stan and multilevel_regression_ncp.stan in the stan folder.

First, fit the centered version:

# fit the centered version of the regression
# check model summaries and trace plots for model coefficients 

Then, fit the noncentered version:

# fit the noncentered version of the regression
# check model summaries and trace plots for model coefficients