library(tidyverse)

# set the ggplot theme
theme_set(theme_minimal())

# center-align figures
knitr::opts_chunk$set(fig.align = "center")

# set a random seed and the number of simulations
set.seed(123)
nsims <- 10000

# install this package if you don't have it: 
# install.packages("LaplacesDemon")

Simulating by sampling

Why is simulation important?

If you have a known distribution (a prior, or a likelihood) you can simulate data from it to understand how it will behave.
If you don’t know a (posterior) distribution, you can still approximate it by sampling from it.
To check the behavior of your model, you can simulate more data from the (posterior predictive) distribution using sampled parameter values.

Discrete distribution: Poisson

The Poisson distribution is useful for modeling count data; it’s a discrete distribution that only takes on integer values >= 0.

First, we can use dpois to analytically calculate the density for a Poisson distribution with \(\lambda = 5\). (We stop at an arbitrary large value of x.)

xs <- 0:15
densities <- dpois(x = xs, lambda = 5)

df <- data_frame(x = xs, densities = densities) 
ggplot(df, aes(x = x, y = densities)) + geom_col()

Then, if we draw samples from the same distribution, we should get something close to the theoretical densities.

poisson_samples <- rpois(n = nsims, lambda = 5)

# first, plot counts
data_frame(samples = poisson_samples) %>%
  ggplot(aes(x = samples)) + 
  geom_bar()

# then, divide counts by number of samples
# to get probability masses
data_frame(samples = poisson_samples) %>%
  count(samples) %>%
  mutate(frac = n/nsims) %>%
  ggplot(aes(x = samples, y = frac)) + 
  geom_col()

Continuous distribution: Gamma

The gamma distribution is a continuous probability distribution. We can plot an approximation of the shape from the true densities using a grid of values.

We’ll use \(Gamma(2, 0.1)\)—a shape parameter of 2, and a rate (or inverse scale) parameter of 0.1. Try changing these values and see how the distribution changes.

(Why’d I pick this shape and rate? It’s a recommended prior choice for something in Stan.)

# analytic densities
xs <- seq(0, 120, by = 2)

gamma_densities <- dgamma(x = xs, shape = 2, rate = 0.1)

data_frame(x = xs, densities = gamma_densities) %>%
  ggplot(aes(x = x, y = densities)) +
  geom_point(size = .5) + 
  geom_line()

Since this is a continuous distribution, we’ll use a kernel density (rather than a histogram) to plot the samples:

# densities from samples
gamma_samples <- rgamma(n = nsims, shape = 2, rate = 0.1)

data_frame(samples = gamma_samples) %>%
  ggplot(aes(x = samples)) + 
  geom_density()

If you know the theoretical distribution, another way to compare the samples to it is with a quantile-quantile plot:

data_frame(samples = gamma_samples) %>%
  ggplot(aes(sample = samples)) + 
  geom_qq(distribution = qgamma, 
          dparams = list(shape = 2, rate = 0.1)) + 
  geom_qq_line(distribution = qgamma,
               dparams = list(shape = 2, rate = 0.1))

Key thing to notice here: the approximation is less good in the long tail.

Summarizing

Visually inspecting distributions is always a good idea! But quantitative summaries are also useful for describing distributions.

Single values

There are multiple ways to summarize the central tendency of a distribution with a single value, including the mean, median, and mode (the value with the highest probability density). For some distributions (e.g. the normal distribution), these values are the same … but that’s not always true.

If you’re giving a point estimate for a parameter, one of these values would be what you’d report.

We can calculate those values from the Poisson samples:

mean(poisson_samples)

## [1] 4.9746

median(poisson_samples)

## [1] 5

# mode (there's no built-in R function)
data_frame(samples = poisson_samples) %>%
  count(samples) %>%
  filter(n == max(n))

## # A tibble: 1 x 2
##   samples     n
##     <int> <int>
## 1       4  1812

# Or, from the LaplacesDemon package:
LaplacesDemon::Mode(poisson_samples)

## [1] 4

And from the gamma samples:

mean(gamma_samples)

## [1] 19.83828

median(gamma_samples)

## [1] 16.60305

LaplacesDemon::Mode(gamma_samples)

## [1] 10.93485

Intervals

We can also use intervals to summarize distributions. If these intervals are for estimates of a parameter in a posterior distribution, we call them credible intervals. There are two common intervals used to summarize distributions: percentiles and HDIs (highest density intervals).

For the Poisson distribution, the 50% credible intervals:

# percentiles
summary(poisson_samples)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   5.000   4.975   6.000  16.000

quantile(poisson_samples, probs = c(.25, .75))

## 25% 75% 
##   3   6

# HDIs
# the coda package can calculate HDIs
# but only if we pretend our samples came from MCMC
coda::HPDinterval(coda::as.mcmc(poisson_samples), 
                  prob = .5)

##      lower upper
## var1     4     6
## attr(,"Probability")
## [1] 0.5

For the gamma distribution:

quantile(gamma_samples, probs = c(.25, .75))

##       25%       75% 
##  9.591237 26.491549

coda::HPDinterval(coda::as.mcmc(gamma_samples), 
                  prob = .5)

##         lower   upper
## var1 3.738561 18.3761
## attr(,"Probability")
## [1] 0.5

Note that these are fairly different!

We can plot all of these possibilities:

df <- data_frame(samples = gamma_samples)
p <- ggplot(df, aes(x = samples)) + geom_density()
 
p + 
  geom_vline(xintercept = mean(gamma_samples), color = "blue") +
  labs(title = "Mean")

p +   
  geom_vline(xintercept = median(gamma_samples), color = "purple") + 
  geom_vline(xintercept = quantile(gamma_samples, probs = c(.25, .75)), 
             color = "purple", linetype = "dashed") +
  labs(title = "Median and 25%-75% percentile")

hdi <- coda::HPDinterval(coda::as.mcmc(gamma_samples), prob = .5)

p + 
  geom_vline(xintercept = LaplacesDemon::Mode(gamma_samples), color = "red") +
  geom_vline(xintercept = c(hdi[1], hdi[2]),  
             color = "red", linetype = "dashed") +
  labs(title = "Mode and 50% HDI")

What should you use? It depends! The mean is commonly used in Bayesian statistics, though rstan and bayesplot use the median and percentiles by default. The mode is less common, but it has a nice correspondence to the maximum likelihood estimate. In a Bayesian context, this is called maximum a posteriori estimation.

The normal distribution

Monte Carlo simulation

Approximating probability masses and densities through sampling is called Monte Carlo simulation. For continuous distributions, we’re approximating an integral. You can do this for arbitrary parts of the distribution:

normal_samples <- rnorm(nsims, mean = 0, sd = 1)

# how much probability is between -1 and +1 sd in a normal distribution?
sum(normal_samples >= -1 & normal_samples <= 1) / nsims

## [1] 0.679

# how accurate is this? 
pnorm(1) - pnorm(-1)

## [1] 0.6826895

# what about 0 and .5 sd?
sum(normal_samples >= 0 & normal_samples <= .5) / nsims

## [1] 0.1842

pnorm(.5) - pnorm(0)

## [1] 0.1914625

Try changing nsims and see how the approximation improves.

Sampling and information

Information content (also called surprisal) is a way of characterizing the amount of information gained from sampling. It’s related to probability: \(-log(p)\). \(p\) comes from the distribution you choose to model the sample, so how informative or surprising you think each observation is depends on what model you choose for it.

The expected information content of a sample is the entropy, which is \(- \sum p_i log(p_i)\) for a discrete variable. For continuous distributions, there are often analytic solutions for entropy: https://en.wikipedia.org/wiki/Differential_entropy#Differential_entropies_for_various_distributions

# calculate surprisal for our normal samples
df <- 
  data_frame(samples = normal_samples) %>%
  rowid_to_column(var = "n") %>%               # index for each observation
  mutate(p = dnorm(samples, mean = 0, sd = 1), # probability of each observation
         surprisal = -log(p),                  # information content of each obs
         cumulative_sum = cumsum(surprisal),   # sample surprisal up to n
         cumulative_mean = cummean(surprisal)) # average sample surprisal to n
  

# calculate entropy
ent_normal <- 0.5 * log(2 * pi * exp(1) * 1^2)
ent_normal

## [1] 1.418939

# plot average surprisal 
ggplot(df, aes(x = n, y = cumulative_mean)) + 
  geom_line() + 
  labs(y = "Surprisal (cumulative average)") + 
  geom_hline(yintercept = ent_normal, linetype = "dashed", color = "blue")

If you choose the wrong parameters or wrong model, the average sample surprisal will converge to something higher than the actual entropy.

Practice

Discrete distribution: binomial. Simulate from the binomial distribution with some N and \(\pi\) and plot/summarize. Compare to the analytic density.
Continuous distribution: beta. Look up the beta distribution. Simulate from it with a couple different values of the two shape parameters, then plot/summarize. Compare to the analytic density.

Notice that the beta distribution is continuous and bounded between 0 and 1, which makes it convenient for representing probabilities. For next week, think about what we might be able to do if we made a grid of probabilities rather than Xs for the binomial distribution.

Appendix

Session Info

sessionInfo()

## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1   dplyr_0.7.5    
##  [5] purrr_0.2.5     readr_1.1.1     tidyr_0.8.1     tibble_1.4.2   
##  [9] ggplot2_3.1.0   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4     reshape2_1.4.3       haven_1.1.1         
##  [4] lattice_0.20-38      colorspace_1.4-0     htmltools_0.3.6     
##  [7] yaml_2.1.19          utf8_1.1.4           rlang_0.3.0.1       
## [10] pillar_1.2.3         foreign_0.8-71       glue_1.3.1          
## [13] withr_2.1.2          modelr_0.1.2         readxl_1.1.0        
## [16] bindr_0.1.1          plyr_1.8.4           munsell_0.5.0       
## [19] gtable_0.2.0         cellranger_1.1.0     rvest_0.3.2         
## [22] coda_0.19-1          LaplacesDemon_16.1.0 psych_1.8.4         
## [25] evaluate_0.10.1      labeling_0.3         knitr_1.20          
## [28] parallel_3.5.3       broom_0.4.4          Rcpp_1.0.0          
## [31] scales_1.0.0         backports_1.1.2      jsonlite_1.6        
## [34] mnormt_1.5-5         hms_0.4.2            digest_0.6.18       
## [37] stringi_1.2.3        grid_3.5.3           rprojroot_1.3-2     
## [40] cli_1.0.0            tools_3.5.3          magrittr_1.5        
## [43] lazyeval_0.2.1       crayon_1.3.4         pkgconfig_2.0.1     
## [46] xml2_1.2.0           lubridate_1.7.4      assertthat_0.2.0    
## [49] rmarkdown_1.10       httr_1.3.1           rstudioapi_0.7      
## [52] R6_2.3.0             nlme_3.1-137         compiler_3.5.3

Simulating and summarizing distributions

Lab 2, CSSS 564

Connor Gilroy