Goals

  • Understand shrinkage/regularization of coefficients as an alternative/complement to variable selection
  • Use three specific approaches to regularization—ridge, lasso, hierarchical shrinkage—which regularize through different priors, with different consequences for sparsity
  • Recognize connections between Bayesian regularized regression and regularized regression in a machine learning / penalized MLE framework

References

  • James et al, Introduction to Statistical Learning, Ch 6
  • Hastie et al, Elements of Statistical Learning, Ch 3
  • Murphy, Machine Learning
    • Ch 7.5 (Ridge regression)
    • Ch 13 (Sparse linear models)

Experiments in sparsity through different priors:

https://betanalpha.github.io/assets/case_studies/bayes_sparse_regression.html

Introduction

What do you do when you have a bunch of potential predictor variables or covariates? Do you experiment with different combinations, or throw them all into the same model?

If you do the latter, but impose some skepticism (i.e. regularization), you can wind up with better predictive models. One way of thinking about this is that you’re deliberately adding bias in order to reduce variance. Regularization helps avoid overfitting to your data, even as you include many variables.

You can think about regularized models from a penalized MLE approach or a Bayesian approach. The first perspective is common in machine learning. Instead of picking the coefficient values that minimize the residual sum of squares (RSS), you add a penalty for large coefficients:

argmin (RSS + penalty parameter * summary of coefficients)

From a Bayesian perspective, this turns out to be the same as putting a prior that encourages coefficients to shrink toward zero:

beta_k ~ Distribution(0, scale)

Because Bayesians care about posterior distributions, models that are mathematically the same don’t always have the same consequences (see the lasso section below). Remember that MLE values correspond to posterior modes, aka MAP estimates.

Setup

Throughout, we’ll compare the Bayesian approach to regularization to a machine learning approach from the glmnet package. Install that package.

library("rstan")
library("glmnet")
library("tidyverse")

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
theme_set(theme_minimal())
knitr::opts_chunk$set(fig.align = "center")

set.seed(123)

Data

We’ll use data on prostate cancer with several clinical measures. This is a conventional example, but in practice these approaches are useful for even more variables—tens, or even hundreds.

These data are available in the lasso2 package and in the data/ folder.

data("Prostate", package = "lasso2")
Prostate <- as_tibble(Prostate)

# write the model as a formula
# the "-1" gets rid of the intercept
f <- lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45 - 1

# format for Stan
prostate_data <- list(
  y = Prostate$lpsa,
  X = model.matrix(f, data = Prostate)
)

# indices and scaling
prostate_data$N <- nrow(prostate_data$X)
prostate_data$K <- ncol(prostate_data$X)
prostate_data$X <- scale(prostate_data$X)

Notice that we’re not scaling y. Look at the models in the stan/ folder, and see if you can figure out why we don’t need to, and what the models are doing instead.

Ridge regression

In penalized maximum likelihood, “ridge regression” means using a penalty related to the squares of the coefficients:

\[\text{argmin}[ RSS + \lambda \sum \beta_k^2 ]\]

In ML, \(\sqrt{\sum \beta_k^2}\) is often called the l2-norm.

In Bayesian ridge regression, the equivalent is putting a normal prior on the coefficients:

\[\beta_k \sim \text{Normal}(0, \tau)\]

The global scale parameter tau is inversely related to lambda:

\[\tau = \frac{1}{\sqrt{2 \lambda}}\]

glmnet uses a series of lambda values, and cv.glmnet tries to pick the best lambda using cross-validation.

fit_ridge_glmnet <- cv.glmnet(x = prostate_data$X, y = prostate_data$y, 
                              alpha = 0)

In the first version of the Bayesian model, tau is fixed. This is identical to the Bayesian linear regression we’ve seen before, with weakly informative priors.

mod_ridge1 <- stan_model("stan/ridge_regression_1.stan")

Like using a series of lambdas, we can use different fixed values of tau to achieve different levels of shrinkage toward zero.

tau_values <- c(4, 2, 1, .5, .04)
fit_ridge1_tau1 <- sampling(mod_ridge1, data = c(prostate_data, 
                                                 tau = tau_values[1]))
fit_ridge1_tau2 <- sampling(mod_ridge1, data = c(prostate_data, 
                                                 tau = tau_values[5]))

Compare the coefficients as tau gets smaller:

print(fit_ridge1_tau1, pars = c("a", "b", "sigma"))
## Inference for Stan model: ridge_regression_1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## a      2.48       0 0.07  2.34  2.43  2.48  2.52  2.62  4263    1
## b[1]   0.69       0 0.11  0.48  0.62  0.69  0.76  0.90  3796    1
## b[2]   0.23       0 0.09  0.06  0.16  0.23  0.28  0.39  4316    1
## b[3]  -0.15       0 0.09 -0.32 -0.20 -0.15 -0.09  0.02  4182    1
## b[4]   0.16       0 0.09 -0.02  0.10  0.16  0.21  0.33  4305    1
## b[5]   0.32       0 0.10  0.11  0.25  0.32  0.39  0.52  3971    1
## b[6]  -0.15       0 0.13 -0.41 -0.24 -0.15 -0.06  0.11  4162    1
## b[7]   0.03       0 0.12 -0.19 -0.05  0.03  0.11  0.27  3573    1
## b[8]   0.13       0 0.13 -0.13  0.04  0.13  0.21  0.37  3383    1
## sigma  0.72       0 0.06  0.62  0.68  0.71  0.75  0.84  3279    1
## 
## Samples were drawn using NUTS(diag_e) at Wed May 29 16:36:23 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(fit_ridge1_tau1, pars = c("a", "b", "sigma"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

print(fit_ridge1_tau2, pars = c("a", "b", "sigma"))
## Inference for Stan model: ridge_regression_1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##       mean se_mean   sd  2.5%   25%  50%  75% 97.5% n_eff Rhat
## a     2.48       0 0.10  2.28  2.41 2.48 2.54  2.67  7938    1
## b[1]  0.10       0 0.04  0.02  0.07 0.10 0.13  0.18  7645    1
## b[2]  0.05       0 0.04 -0.03  0.02 0.05 0.08  0.13  7145    1
## b[3]  0.02       0 0.04 -0.06 -0.01 0.02 0.04  0.09  7777    1
## b[4]  0.02       0 0.04 -0.05  0.00 0.02 0.05  0.10  6554    1
## b[5]  0.07       0 0.04  0.00  0.05 0.07 0.10  0.15  7065    1
## b[6]  0.06       0 0.04 -0.01  0.04 0.06 0.09  0.14  8121    1
## b[7]  0.04       0 0.04 -0.04  0.01 0.04 0.06  0.11  8944    1
## b[8]  0.05       0 0.04 -0.03  0.02 0.05 0.07  0.12  7451    1
## sigma 0.98       0 0.08  0.83  0.92 0.98 1.04  1.16  6026    1
## 
## Samples were drawn using NUTS(diag_e) at Wed May 29 16:36:24 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(fit_ridge1_tau2, pars = c("a", "b", "sigma"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

In the second version, we do something new, using an idea you’ve seen before. We actually treat tau as a model parameter to estimate, and put a prior on it. For instance (though Student-T priors with higher degrees of freedom are also reasonable):

\[\tau \sim \text{Cauchy}^{+}(0, 1)\]

Compile and fit the second model:

mod_ridge2 <- stan_model("stan/ridge_regression_2.stan")
fit_ridge2 <- sampling(mod_ridge2, data = prostate_data)

Examine the results:

print(fit_ridge2, pars = c("a", "b", "sigma", "tau"))
## Inference for Stan model: ridge_regression_2.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## a      2.48       0 0.07  2.33  2.43  2.48  2.53  2.62  5506    1
## b[1]   0.62       0 0.10  0.41  0.55  0.62  0.69  0.82  4126    1
## b[2]   0.22       0 0.08  0.07  0.17  0.22  0.28  0.38  4466    1
## b[3]  -0.12       0 0.08 -0.28 -0.17 -0.12 -0.07  0.04  4135    1
## b[4]   0.14       0 0.08 -0.02  0.08  0.14  0.20  0.30  4294    1
## b[5]   0.29       0 0.10  0.10  0.22  0.29  0.35  0.48  4320    1
## b[6]  -0.07       0 0.12 -0.31 -0.15 -0.07  0.01  0.17  3574    1
## b[7]   0.04       0 0.10 -0.16 -0.03  0.04  0.12  0.25  3765    1
## b[8]   0.10       0 0.12 -0.12  0.03  0.10  0.17  0.33  3495    1
## sigma  0.72       0 0.05  0.62  0.68  0.71  0.75  0.83  4121    1
## tau    0.33       0 0.11  0.18  0.26  0.31  0.38  0.61  2814    1
## 
## Samples were drawn using NUTS(diag_e) at Wed May 29 16:36:28 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(fit_ridge2, pars = c("a", "b", "sigma", "tau"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

Lasso regression

“Lasso regression” means using a penalty related to the absolute values of the coefficients, the l1-norm:

\[\text{argmin}[ RSS + \lambda \sum | \beta_k | ]\]

In the Bayesian lasso, we replace the normal priors on the coefficients with Laplace priors:

\[\beta_k \sim \text{Laplace}(0, \tau)\]

Here’s the glmnet version of lasso:

fit_lasso_glmnet <- cv.glmnet(x = prostate_data$X, y = prostate_data$y, 
                              alpha = 1)

Again, the first Bayesian model uses a fixed tau:

mod_lasso1 <- stan_model("stan/lasso_regression_1.stan")
tau_values <- c(4, 2, 1, .5, .04)
fit_lasso1_tau1 <- sampling(mod_lasso1, data = c(prostate_data, 
                                                 tau = tau_values[1]))
fit_lasso1_tau2 <- sampling(mod_lasso1, data = c(prostate_data, 
                                                 tau = tau_values[5]))
print(fit_lasso1_tau1, pars = c("a", "b", "sigma"))
## Inference for Stan model: lasso_regression_1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## a      2.48       0 0.07  2.34  2.43  2.48  2.52  2.61  5272    1
## b[1]   0.69       0 0.11  0.48  0.62  0.69  0.76  0.90  4979    1
## b[2]   0.22       0 0.09  0.05  0.16  0.22  0.28  0.40  4302    1
## b[3]  -0.14       0 0.09 -0.31 -0.20 -0.14 -0.08  0.03  4500    1
## b[4]   0.15       0 0.09 -0.02  0.10  0.15  0.21  0.33  4290    1
## b[5]   0.31       0 0.10  0.11  0.25  0.31  0.38  0.52  4590    1
## b[6]  -0.14       0 0.13 -0.39 -0.23 -0.14 -0.05  0.11  3759    1
## b[7]   0.03       0 0.11 -0.19 -0.04  0.03  0.11  0.25  4279    1
## b[8]   0.12       0 0.12 -0.12  0.04  0.12  0.20  0.38  3937    1
## sigma  0.72       0 0.05  0.62  0.68  0.71  0.75  0.83  4721    1
## 
## Samples were drawn using NUTS(diag_e) at Wed May 29 16:36:31 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(fit_lasso1_tau1, pars = c("a", "b", "sigma"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

print(fit_lasso1_tau2, pars = c("a", "b", "sigma"))
## Inference for Stan model: lasso_regression_1.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%  50%  75% 97.5% n_eff Rhat
## a      2.48       0 0.08  2.33  2.43 2.48 2.53  2.63  3573    1
## b[1]   0.53       0 0.10  0.33  0.46 0.53 0.59  0.72  2978    1
## b[2]   0.12       0 0.07 -0.01  0.07 0.12 0.17  0.27  2958    1
## b[3]  -0.01       0 0.04 -0.10 -0.03 0.00 0.02  0.08  2225    1
## b[4]   0.05       0 0.05 -0.04  0.01 0.04 0.08  0.17  2818    1
## b[5]   0.16       0 0.09  0.00  0.10 0.16 0.22  0.34  2936    1
## b[6]   0.04       0 0.06 -0.05  0.00 0.03 0.07  0.18  2405    1
## b[7]   0.03       0 0.05 -0.05  0.00 0.02 0.06  0.14  2479    1
## b[8]   0.04       0 0.05 -0.04  0.00 0.03 0.07  0.17  2113    1
## sigma  0.75       0 0.06  0.64  0.71 0.75 0.79  0.88  2979    1
## 
## Samples were drawn using NUTS(diag_e) at Wed May 29 16:36:32 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(fit_lasso1_tau2, pars = c("a", "b", "sigma"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

And once again, a more Bayesian approach to tau is, rather than picking some arbitrary values, to try to estimate it by putting a prior on it. That’s what this second model does.

mod_lasso2 <- stan_model("stan/lasso_regression_2.stan")
fit_lasso2 <- sampling(mod_lasso2, data = prostate_data)
print(fit_lasso2, pars = c("a", "b", "sigma", "tau"))
## Inference for Stan model: lasso_regression_2.
## 4 chains, each with iter=2000; warmup=1000; thin=1; 
## post-warmup draws per chain=1000, total post-warmup draws=4000.
## 
##        mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## a      2.48       0 0.07  2.33  2.43  2.48  2.53  2.62  4355    1
## b[1]   0.64       0 0.10  0.44  0.57  0.64  0.71  0.84  3775    1
## b[2]   0.21       0 0.08  0.04  0.15  0.20  0.26  0.37  4047    1
## b[3]  -0.09       0 0.08 -0.26 -0.15 -0.09 -0.04  0.05  3738    1
## b[4]   0.13       0 0.08 -0.03  0.07  0.12  0.18  0.29  3933    1
## b[5]   0.27       0 0.10  0.07  0.20  0.27  0.34  0.47  3647    1
## b[6]  -0.05       0 0.11 -0.28 -0.12 -0.04  0.02  0.15  3322    1
## b[7]   0.03       0 0.09 -0.15 -0.02  0.03  0.09  0.22  3076    1
## b[8]   0.09       0 0.10 -0.10  0.01  0.08  0.15  0.30  2747    1
## sigma  0.72       0 0.06  0.62  0.68  0.72  0.75  0.84  3970    1
## tau    0.26       0 0.12  0.11  0.18  0.24  0.31  0.57  2739    1
## 
## Samples were drawn using NUTS(diag_e) at Wed May 29 16:36:36 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).
plot(fit_lasso2, pars = c("a", "b", "sigma", "tau"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

Regularization paths

What happens as tau decreases? (Or as lambda, in glmnet terms, increases?) All of the coefficients are shrunk toward 0, but at different rates.

In machine-learning lasso regression, some coefficients actually go to 0, so the paths look very different for ridge and for lasso. This doesn’t happen with the posterior distributions in the Bayesian lasso, as you can confirm from the coefficient plots above.

p_ridge <- glmnet(x = prostate_data$X, y = prostate_data$y, alpha = 0)
p_lasso <- glmnet(x = prostate_data$X, y = prostate_data$y, alpha = 1)
plot(p_ridge, xvar = "lambda")

plot(p_lasso, xvar = "lambda")

Hierarchical shrinkage

The Bayesian lasso doesn’t get us sparsity, but can we get there? What kinds of prior shapes would encourage sparsity? (That is, a few relatively large coefficients, and many coefficients very close to zero.)

We can use different global-local scale mixtures of normal distributions as our priors to encourage more sparsity. (You’ve seen the Student-T distribution is one of these scale mixtures, and the lasso is actually one of them too.)

We combine the global scale for the coefficient priors, tau, with a local scale lambda. (Sorry, there aren’t enough Greek letters to go around…)

\[\beta_{k} \sim \text{Normal}(0, \lambda_k \tau)\]

(Written Stan-style, so \(\lambda_k \tau\) is the standard deviation.)

We draw those lambdas from some distribution.

\[\lambda_k \sim \text{Cauchy}^{+}(0, 1)\]

If we use a half-cauchy, this is the “horseshoe” prior, one special case of the hierarchical shrinkage prior (hs() in rstanarm).

As usual, we can fix tau or put a prior on it:

mod_hs1 <- stan_model("stan/hierarchical_shrinkage_1.stan")
mod_hs2 <- stan_model("stan/hierarchical_shrinkage_2.stan")

You can see that these models are harder to sample from, because they have more divergent transitions:

fit_hs1 <- sampling(mod_hs1, data = c(prostate_data, tau = tau_values[4]), 
                    control = list(adapt_delta = .999, 
                                   max_treedepth = 15))
## Warning: There were 8 divergent transitions after warmup. Increasing adapt_delta above 0.999 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems

Increasing adapt_delta makes sampling slower, but even increasing it a great deal doesn’t prevent all divergent transitions in this case. http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

plot(fit_hs1, pars = "b")
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

plot(fit_hs1, pars = "lambda")
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)

fit_hs2 <- sampling(mod_hs2, data = prostate_data, 
                    control = list(adapt_delta = .999, 
                                   max_treedepth = 15))
## Warning: There were 6 divergent transitions after warmup. Increasing adapt_delta above 0.999 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: Examine the pairs() plot to diagnose sampling problems
plot(fit_hs2, pars = c("lambda", "tau"))
## ci_level: 0.8 (80% intervals)
## outer_level: 0.95 (95% intervals)