Goals

  • Understand how to write and fit generalized linear models for count data
  • Compare two models graphically against different aspects of the original data
  • Introduce measures of model fit for model comparison

References

This lab is adapted from two Stan-related vignettes:

See also Kruschke Ch 15 (overview of GLMs) and Ch 24 (count outcomes)

Kruschke doesn’t have much on information criteria, but Ch 6 of McElreath is a good introduction. His lecture 8 on model comparison here is also good: https://github.com/rmcelreath/statrethinking_winter2019

Setup

We’ll use all of the packages from last week, plus a new one: loo.

library("rstan")
library("tidyverse")
library("bayesplot")
library("loo")

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
theme_set(theme_minimal())
knitr::opts_chunk$set(fig.align = "center")

set.seed(123)

Data

The data set we’ll use is a particularly charming one on pest treatment for roaches in urban aparments, from Gelman & Hill (2007).

# the data are also in the data/ folder if you can't load from the package
data("roaches", package = "rstanarm")
roaches <- as_tibble(roaches)
glimpse(roaches)
## Observations: 262
## Variables: 5
## $ y         <int> 153, 127, 7, 7, 0, 0, 73, 24, 2, 2, 0, 21, 0, 179, 1...
## $ roach1    <dbl> 308.00, 331.25, 1.67, 3.00, 2.00, 0.00, 70.00, 64.56...
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ senior    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ exposure2 <dbl> 0.8000000, 0.6000000, 1.0000000, 1.0000000, 1.142857...

Set up the data for Stan. A couple things to note:

  • In one vignette, they choose to scale the pretreatment roaches roach1 by dividing by 100. I’ve followed that approach so our coefficients match.
  • treatment and senior are binary indicators, so we don’t have to scale them. (There’s debate on this, though!)
  • offset is the log of the amount of time each trap was laid out for. It’s like a variable with a fixed coefficent of 1.

Here’s the math for that:

log(mu/exposure) = a + xb
log(mu) - log(exposure) = a + xb
log(mu) = a + xb + log(exposure)
mu = exp(a + xb + log(exposure))
covariates <- 
  roaches %>%
  mutate(roach1 = roach1/100) %>% 
  # mutate(roach1 = scale(roach1)[, 1]) %>% # this is an option too!
  dplyr::select(roach1, treatment, senior) 

d <- list(
  y = roaches$y,
  X = as.matrix(covariates),
  offset = log(roaches$exposure2)
)

d$N <- length(d$y)
d$K <- ncol(d$X)

Models

GLMs use link functions to connect non-continuous outcomes to linear predictors. We saw one last week, a hierarchical logit model for binary outcomes. This week, we’ll consider (unbounded) count outcomes: 0, 1, 2, 3 …

Poisson

Here’s a Poisson model for count data:

y ~ Poisson(lambda)  // likelihood
lambda = exp(a + xb + log(exposure))  // link
b ~ Normal(0, 2.5)  // priors
a ~ Normal(0, 10)

Let’s look at the Stan implementation of this model, fit it, and examine the results.

fit_poisson <- stan("stan/poisson.stan", data = d)
print(fit_poisson, pars = c("a", "b", "lp__"))
## Inference for Stan model: poisson.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##          mean se_mean   sd     2.5%      25%      50%      75%    97.5%
## a        3.09    0.00 0.02     3.05     3.08     3.09     3.10     3.13
## b[1]     0.70    0.00 0.01     0.68     0.69     0.70     0.70     0.72
## b[2]    -0.52    0.00 0.02    -0.57    -0.53    -0.52    -0.50    -0.47
## b[3]    -0.38    0.00 0.03    -0.45    -0.40    -0.38    -0.36    -0.31
## lp__ 17603.81    0.04 1.48 17599.98 17603.10 17604.17 17604.89 17605.60
##      n_eff Rhat
## a     2032    1
## b[1]  2474    1
## b[2]  2854    1
## b[3]  2983    1
## lp__  1510    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 12:59:35 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Negative binomial

The variance of a Poisson model is equal to its mean, lambda. The negative binomial model lets us relax this assumption and allow for overdispersion.

There are two main ways of parameterizing the negative binomial.

First, with two shape parameters:

y ~ NegBinomial(alpha, beta)

Second, with a mean/location and an overdispersion parameter:

y ~ NegBinomial(mu, phi)

In this case, Var(y) = mu + mu^2/phi.

The second parameterization (neg_binomial_2 in Stan) is more useful because it’s more clearly linked to the Poisson distribution, and we can link the location parameter to covariates.

What prior should we put on phi? As phi -> infinity, the negative binomial model approaches the Poisson. How can we encourage the model to “shrink” toward the Poisson model? By putting a prior on the reciprocal. This puts more of the probability mass of the prior toward 0.

1/sqrt(phi) ~ Exponential(1)

(Turns out, it’s even better to use the square root of the reciprocal.)

More on the prior for the over-dispersion parameter:

https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations#story-when-the-generic-prior-fails-the-case-of-the-negative-binomial

https://statmodeling.stat.columbia.edu/2018/04/03/justify-my-love/ (Note: what Dan Simpson calls phi here is what we’re calling 1/phi!)

Now, let’s look at the Stan model, fit it, and examine the results.

fit_neg_binomial <- stan("stan/neg_binomial.stan", data = d)
print(fit_neg_binomial, pars = c("a", "b", "phi"))
## Inference for Stan model: neg_binomial.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## a     2.85       0 0.23  2.41  2.69  2.83  3.00  3.33  2765    1
## b[1]  1.31       0 0.25  0.86  1.14  1.30  1.48  1.85  3187    1
## b[2] -0.77       0 0.25 -1.28 -0.94 -0.77 -0.60 -0.28  3066    1
## b[3] -0.32       0 0.26 -0.83 -0.50 -0.32 -0.14  0.20  3706    1
## phi   0.27       0 0.03  0.22  0.25  0.27  0.29  0.33  3413    1
## 
## Samples were drawn using NUTS(diag_e) at Thu May 23 12:59:43 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

Posterior predictive checks

See this vignette for extracting information from a stanfit object:

http://mc-stan.org/rstan/articles/stanfit_objects.html

Densities

yrep_pois <- rstan::extract(fit_poisson, pars = "y_rep")$y_rep

# yrep_pois_alt <- as.matrix(fit_poisson, pars = "y_rep")
# extract() permutes the order of the draws, 
# so these two matrices aren't in the same order

ppc_dens_overlay(y = d$y, yrep = yrep_pois[1:50, ]) + xlim(0, 100)
## Warning: Removed 323 rows containing non-finite values (stat_density).
## Warning: Removed 23 rows containing non-finite values (stat_density).

# changing xlim to ignore the long tail
yrep_nb <- rstan::extract(fit_neg_binomial, pars = "y_rep")$y_rep

ppc_dens_overlay(y = d$y, yrep = yrep_nb[1:50, ]) + xlim(0, 100)
## Warning: Removed 844 rows containing non-finite values (stat_density).
## Warning: Removed 23 rows containing non-finite values (stat_density).

# changing xlim to ignore the VERY long tail

Stats

ppc_stat and ppc_stat_2d calculate statistics on the samples and compare them to the actual data.

ppc_stat(d$y, yrep_pois, stat = "mean")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ppc_stat(d$y, yrep_nb, stat = "mean")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Other statistics to try out:

  • standard deviation or variance
  • proportion of zeros
  • max
# try some out here!

Model comparison

We’ll introduce two formal ways of comparing models. The basic idea is that while more complex models capture more information about the data and avoid underfitting, we want to penalize model complexity to avoid overfitting. To do the latter, we come up with some way of approximating out-of-sample fit on the data we have.

Like with the PPCs, model comparison involves adding something to the generated quantities block. Here’s what we need:

generated quantities {
  // log-likelihood posterior
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = poisson_lpmf(y[i] | lambda[i]);
  }
}

(in the Stan file, we use the same for-loop for both the random y-rep values and the log-likelihood values)

lpmf stands for log probability mass function. It calculates the pointwise log-likelihood—for each data point, for each MCMC draw.

If it’s a continuous distribution, like the normal distribution, the suffix is lpdf (log probability density function) instead.

https://mc-stan.org/loo/articles/loo2-with-rstan.html

(rng, by the way, stands for random number generator.)

We’ll start our comparisons by pulling out the pointwise log likelihoods from each model. loo has a convenience function for this, but it’s basically the same as rstan::extract.

ll_poisson <- extract_log_lik(fit_poisson, merge_chains = FALSE)
ll_neg_binomial <- extract_log_lik(fit_neg_binomial, merge_chains = FALSE)

Information criteria: WAIC

You might be familiar with information criteria like AIC and BIC. There’s also DIC, which is like AIC but incorporates priors. Most of these are approximations of the out-of-sample deviance that penalize the number of parameters (~ model complexity) somehow. (Deviance is -2 * the log likelihood. It’s a sum, not an average.)

The most useful IC for us is called WAIC (which stands for widely-applicable information criterion or Watanabe-Akaike information criterion). It has two components:

  • the log pointwise predictive density, lppd. This is sum(log(Pr(y_i))).
  • the effective number of parameters, p_waic. This is less intuitive—it’s the sum of the variances of the log likelihood for each observation y_i.

WAIC is -2 * (lppd - p_waic).

waic_poisson <- waic(ll_poisson)
## Warning: 50 (19.1%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
waic_poisson
## 
## Computed from 4000 by 262 log-likelihood matrix
## 
##           Estimate     SE
## elpd_waic  -6304.7  743.7
## p_waic       354.3  113.2
## waic       12609.4 1487.4
## Warning: 50 (19.1%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
waic_neg_binomial <- waic(ll_neg_binomial)
## Warning: 2 (0.8%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.
waic_neg_binomial
## 
## Computed from 4000 by 262 log-likelihood matrix
## 
##           Estimate   SE
## elpd_waic   -895.5 37.6
## p_waic         6.4  2.3
## waic        1790.9 75.3
## Warning: 2 (0.8%) p_waic estimates greater than 0.4. We recommend trying
## loo instead.

There’s a function for comparing model fits, which also computes the standard error of the difference.

loo_compare(waic_poisson, waic_neg_binomial)
##        elpd_diff se_diff
## model2     0.0       0.0
## model1 -5409.3     725.2
# want the models to have names when you print them out? use a named list()

The comparison indicates that model 2 (the negative binomial model) is better.

You can compare as many models this way as you’d like, so long as they’re predicting the same data.

LOO-CV

A second approach—and a better one, if you can get it to work—is cross-validation. For many kinds of data, leave-one-out cross-validation is the ideal.

Actually refitting a model N times, leaving out one observation each time, would be … very inefficient. Instead, there’s a method called PSIS, Pareto smoothed importance sampling, which can be used to estimate what you’d get from LOO-CV. PSIS-LOO is what the loo package uses.

Since this is an estimate, we need to look at diagnostics for the Pareto parameters, called k, to see if we think it worked.

loo has methods for stanfit objects or for log-likelihoods extracted from those objects.

loo_poisson <- loo(fit_poisson)
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced

## Warning in log(z): NaNs produced
# the numbers are almost identical using the log-likelihood method
# using r_eff to account for the fact that these are MCMC draws
# keeps the diagnostics from being overconfident
# loo_poisson <- loo(ll_poisson, r_eff = relative_eff(ll_poisson))

loo_poisson
## 
## Computed from 4000 by 262 log-likelihood matrix
## 
##          Estimate     SE
## elpd_loo  -6258.0  730.3
## p_loo       307.5   80.0
## looic     12515.9 1460.6
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     239   91.2%   327       
##  (0.5, 0.7]   (ok)         7    2.7%   90        
##    (0.7, 1]   (bad)        5    1.9%   25        
##    (1, Inf)   (very bad)  11    4.2%   1         
## See help('pareto-k-diagnostic') for details.
plot(loo_poisson)

loo_neg_binomial <- loo(fit_neg_binomial)
## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
loo_neg_binomial
## 
## Computed from 4000 by 262 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -895.8 37.7
## p_loo         6.8  2.6
## looic      1791.7 75.4
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     260   99.2%   2446      
##  (0.5, 0.7]   (ok)         1    0.4%   241       
##    (0.7, 1]   (bad)        1    0.4%   60        
##    (1, Inf)   (very bad)   0    0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.
plot(loo_neg_binomial)

What to do with those diagnostics? One possibility is to refit the model for each observation with a high k-value (above some threshold, e.g. k > 0.7) and calculate the elpd for those observations directly—rather than trying to estimate it. rstanarm can actually do this semi-automatically!

Another option is to fall back to k-fold cross-validation: divide the data into e.g. 5 or 10 “folds” and perform cross-validation on each. loo has some helper functions for this.

Finally, we can compare the elpd_loo values as follows:

loo_compare(loo_poisson, loo_neg_binomial)
##        elpd_diff se_diff
## model2     0.0       0.0
## model1 -5362.1     711.6

Again, the comparison indicates that the negative binomial model is better. The fact that so many k-values are bad for the Poisson model is a strong signal that the model is poorly specific.

Note: loo also has a method for model stacking or averaging. This strikes me as experimental now, but it could become important and widespread in the future.

Appendix: what would JAGS do?

For model comparisons, it seems like DIC is the most common criterion in JAGS.

Another approach JAGS takes is to nest all the models you want to compare into the same JAGS model, and use a categorical parameter to index them (M = 1 for model 1, M = 2 for model 2, etc). By estimating that categorical parameter, you get the relative probabilities for each model. Stan can’t do this, because HMC can’t handle latent discrete parameters.

sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2     loo_2.1.0          bayesplot_1.6.0   
##  [4] forcats_0.3.0      stringr_1.3.1      dplyr_0.7.5       
##  [7] purrr_0.2.5        readr_1.1.1        tidyr_0.8.1       
## [10] tibble_1.4.2       tidyverse_1.2.1    rstan_2.18.2      
## [13] StanHeaders_2.18.1 ggplot2_3.1.0     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0         lubridate_1.7.4    lattice_0.20-38   
##  [4] prettyunits_1.0.2  assertthat_0.2.0   rprojroot_1.3-2   
##  [7] digest_0.6.18      psych_1.8.4        R6_2.3.0          
## [10] cellranger_1.1.0   plyr_1.8.4         ggridges_0.5.0    
## [13] backports_1.1.2    stats4_3.5.3       evaluate_0.10.1   
## [16] httr_1.3.1         pillar_1.2.3       rlang_0.3.0.1     
## [19] lazyeval_0.2.1     readxl_1.1.0       rstudioapi_0.7    
## [22] callr_2.0.4        rmarkdown_1.10     labeling_0.3      
## [25] foreign_0.8-71     munsell_0.5.0      broom_0.4.4       
## [28] compiler_3.5.3     modelr_0.1.2       pkgconfig_2.0.1   
## [31] mnormt_1.5-5       pkgbuild_1.0.3     htmltools_0.3.6   
## [34] tidyselect_0.2.4   gridExtra_2.3      codetools_0.2-16  
## [37] matrixStats_0.53.1 crayon_1.3.4       withr_2.1.2       
## [40] grid_3.5.3         nlme_3.1-137       jsonlite_1.6      
## [43] gtable_0.2.0       magrittr_1.5       scales_1.0.0      
## [46] cli_1.0.0          stringi_1.2.3      reshape2_1.4.3    
## [49] xml2_1.2.0         tools_3.5.3        glue_1.3.1        
## [52] hms_0.4.2          processx_3.1.0     parallel_3.5.3    
## [55] yaml_2.1.19        inline_0.3.15      colorspace_1.4-0  
## [58] rvest_0.3.2        knitr_1.20         bindr_0.1.1       
## [61] haven_1.1.1