Goals

  • Understand how to incorporate covariates into a Bayesian model
  • Fit that model to a data set and explore the output
  • Simulate from the posterior distribution to explore model fit

Relevant readings:

  • Kruschke Chapter 17 (simple linear regression)
  • Kruschke Chapter 18 (multiple linear regression)

Extending the normal model

We’d like to extend a normal model of data to incorporate a single covariate.

This is the model we’re starting with:

data {
  int N;
  vector[N] y;
}
parameters {
  real mu;
  real<lower=0> sigma;
}
model {
  mu ~ normal(0, 10);
  sigma ~ exponential(1);
  y ~ normal(mu, sigma);
}

We’ll need a couple things:

  • a vector of x values
  • parameters for the intercept (alpha) and slope (beta)
  • the linear equation relating those to different values of mu
  • priors for our new parameters instead of for mu (we’ll use “weakly informative” priors)

Open stan/normal.stan and we’ll make the appropriate changes.

Data: centering and scaling

We’ll use data from the pscl package, which is a collection of statistical tools and data sets for political science.

install.packages("pscl")

First, load the data into R. We’ll tweak the format a little to be in line with contemporary best practices.

library(tidyverse)
library(pscl)

# data
df <- 
  as_tibble(unionDensity) %>%
  rownames_to_column(var = "country")

Second, center the data (to have a mean of 0) and scale it (to have a standard deviation of 1) using the scale() function.

Note that scale() returns a matrix, which we don’t want in this case! The [, 1] bit turns it into a vector (as.vector(union) would do the job as well).

# scale data 
df_scaled <- 
  df %>%
  mutate(
    union = scale(union)[, 1],
    left = scale(left)[, 1]
  )

summary(df_scaled)
##    country              union               left              size       
##  Length:20          Min.   :-1.57656   Min.   :-1.1154   Min.   : 4.394  
##  Class :character   1st Qu.:-0.90333   1st Qu.:-0.9087   1st Qu.: 7.567  
##  Mode  :character   Median : 0.05786   Median :-0.0930   Median : 8.196  
##                     Mean   : 0.00000   Mean   : 0.0000   Mean   : 8.391  
##                     3rd Qu.: 0.86707   3rd Qu.: 0.6460   3rd Qu.: 9.713  
##                     Max.   : 1.51097   Max.   : 2.1955   Max.   :11.439  
##      concen     
##  Min.   :0.860  
##  1st Qu.:1.125  
##  Median :1.520  
##  Mean   :1.402  
##  3rd Qu.:1.595  
##  Max.   :2.060

Then, we format the data for Stan:

d <- list(
  N = length(df_scaled$union), 
  y = df_scaled$union, 
  x = df_scaled$left
)

Fitting the model

library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
# fit model

# print model fit

Quick comparison to OLS/MLE:

fit_ols <- lm(union ~ left, data = df_scaled)
summary(fit_ols)
## 
## Call:
## lm(formula = union ~ left, data = df_scaled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8204 -0.5476 -0.1897  0.5764  1.5046 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -8.500e-17  1.689e-01   0.000  1.00000   
## left         6.780e-01  1.733e-01   3.913  0.00102 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7552 on 18 degrees of freedom
## Multiple R-squared:  0.4597, Adjusted R-squared:  0.4296 
## F-statistic: 15.31 on 1 and 18 DF,  p-value: 0.001019

Diagnostics

# trace plots

Also look at and interpret Rhat, ESS (n_eff), and MCSE (se_mean) in the model output above. You can plot these too, but that’s more useful when there are more parameters.

Coefficients

Look at the coefficient output above and plot the coefficients below.

# coefficient plots

# pairs plot

Posterior predictive distribution

We don’t just care about coefficients—we want to assess how well our model fits our data. To do this, we can see what values of y would be generated using the posterior distribution of the parameters. In other words, these simulated y values take into account and carry forward the uncertainty in our parameter estimates.

You could do this in R using the posterior draws from all the coefficients… but we’re already stepping through our data and sampling parameter values, so why not generate some y values along the way?

So, the most efficient way to generate posterior predictive values is from inside Stan, using a new code block called generated quantities:

generated quantities {
  // simulate data from the posterior
  vector[N] y_rep;
  for (i in 1:N) {
    y_rep[i] = normal_rng(mu[i], sigma);
  }
}

Add this block to stan/normal.stan. When we’re interested in calculating model fit statistics, we’ll use this same block to calculate log-likelihood values.

# recompile and refit the model

Posterior predictive checks

One of the features of the bayesplot package is a family of functions for visual posterior predictive checks, ppc_*. These let you plot some feature or statistic of the simulated y values against the actual data.

In some cases, we’ll be able to use all the simulated values to calculate a summary statistic. In other cases, we’ll want to plot only a subset of the y_rep values.

library(bayesplot)

# extract y_rep draws from the stanfit object

# plot posterior predictive density overlay

You can read more about posterior predictive checks in bayesplot here:

http://mc-stan.org/bayesplot/articles/graphical-ppcs.html

On your own: multiple covariates

How would you extend the regression model to handle multiple covariates?

You could do it one-by-one: vector[N] x1;, vector[N] x2;… but that’s not very flexible. Instead, we’ll use a design matrix:

# scale the rest of the data  
df_scaled <- 
  df_scaled %>%
  mutate(size = scale(size)[, 1], 
         concen = scale(concen)[, 1])

# it's a convention to use an upper-case X for a matrix of xs
X <- 
  df_scaled %>%
  select(left, size, concen) %>%
  as.matrix()

# alternatively (and this is better when you want to create dummy variables!)
# you'll want to drop the intercept using `[, -1]`
X <- model.matrix(union ~ left + size + concen, data = df_scaled)[, -1]

You’ll need to update the Stan model with these things:

  • K for the number of covariates
  • an N x K matrix of X values
  • a K length vector of beta parameters to estimate

As long as the x values are on the same scale, you can use the same prior for each beta.

Give it a try yourself, and see if you can fit the Stan model using more than one covariate!

# fit model

You can read about matrices in Stan here: https://mc-stan.org/docs/2_19/reference-manual/vector-and-matrix-data-types.html

After you’ve tinkered with the model a bit, you can find an example here: https://mc-stan.org/docs/2_19/stan-users-guide/linear-regression.html