library(tidyverse)
# set the ggplot theme
theme_set(theme_minimal())
# center-align figures
knitr::opts_chunk$set(fig.align = "center")
# set a random seed and the number of simulations
set.seed(123)
nsims <- 10000
# install this package if you don't have it:
# install.packages("LaplacesDemon")
Why is simulation important?
If you have a known distribution (a prior, or a likelihood) you can simulate data from it to understand how it will behave.
If you don’t know a (posterior) distribution, you can still approximate it by sampling from it.
To check the behavior of your model, you can simulate more data from the (posterior predictive) distribution using sampled parameter values.
The Poisson distribution is useful for modeling count data; it’s a discrete distribution that only takes on integer values >= 0.
First, we can use dpois
to analytically calculate the density for a Poisson distribution with \(\lambda = 5\). (We stop at an arbitrary large value of x.)
xs <- 0:15
densities <- dpois(x = xs, lambda = 5)
df <- data_frame(x = xs, densities = densities)
ggplot(df, aes(x = x, y = densities)) + geom_col()
Then, if we draw samples from the same distribution, we should get something close to the theoretical densities.
poisson_samples <- rpois(n = nsims, lambda = 5)
# first, plot counts
data_frame(samples = poisson_samples) %>%
ggplot(aes(x = samples)) +
geom_bar()
# then, divide counts by number of samples
# to get probability masses
data_frame(samples = poisson_samples) %>%
count(samples) %>%
mutate(frac = n/nsims) %>%
ggplot(aes(x = samples, y = frac)) +
geom_col()
The gamma distribution is a continuous probability distribution. We can plot an approximation of the shape from the true densities using a grid of values.
We’ll use \(Gamma(2, 0.1)\)—a shape parameter of 2, and a rate (or inverse scale) parameter of 0.1. Try changing these values and see how the distribution changes.
(Why’d I pick this shape and rate? It’s a recommended prior choice for something in Stan.)
# analytic densities
xs <- seq(0, 120, by = 2)
gamma_densities <- dgamma(x = xs, shape = 2, rate = 0.1)
data_frame(x = xs, densities = gamma_densities) %>%
ggplot(aes(x = x, y = densities)) +
geom_point(size = .5) +
geom_line()
Since this is a continuous distribution, we’ll use a kernel density (rather than a histogram) to plot the samples:
# densities from samples
gamma_samples <- rgamma(n = nsims, shape = 2, rate = 0.1)
data_frame(samples = gamma_samples) %>%
ggplot(aes(x = samples)) +
geom_density()
If you know the theoretical distribution, another way to compare the samples to it is with a quantile-quantile plot:
data_frame(samples = gamma_samples) %>%
ggplot(aes(sample = samples)) +
geom_qq(distribution = qgamma,
dparams = list(shape = 2, rate = 0.1)) +
geom_qq_line(distribution = qgamma,
dparams = list(shape = 2, rate = 0.1))
Key thing to notice here: the approximation is less good in the long tail.
Visually inspecting distributions is always a good idea! But quantitative summaries are also useful for describing distributions.
There are multiple ways to summarize the central tendency of a distribution with a single value, including the mean, median, and mode (the value with the highest probability density). For some distributions (e.g. the normal distribution), these values are the same … but that’s not always true.
If you’re giving a point estimate for a parameter, one of these values would be what you’d report.
We can calculate those values from the Poisson samples:
mean(poisson_samples)
## [1] 4.9746
median(poisson_samples)
## [1] 5
# mode (there's no built-in R function)
data_frame(samples = poisson_samples) %>%
count(samples) %>%
filter(n == max(n))
## # A tibble: 1 x 2
## samples n
## <int> <int>
## 1 4 1812
# Or, from the LaplacesDemon package:
LaplacesDemon::Mode(poisson_samples)
## [1] 4
And from the gamma samples:
mean(gamma_samples)
## [1] 19.83828
median(gamma_samples)
## [1] 16.60305
LaplacesDemon::Mode(gamma_samples)
## [1] 10.93485
We can also use intervals to summarize distributions. If these intervals are for estimates of a parameter in a posterior distribution, we call them credible intervals. There are two common intervals used to summarize distributions: percentiles and HDIs (highest density intervals).
For the Poisson distribution, the 50% credible intervals:
# percentiles
summary(poisson_samples)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 5.000 4.975 6.000 16.000
quantile(poisson_samples, probs = c(.25, .75))
## 25% 75%
## 3 6
# HDIs
# the coda package can calculate HDIs
# but only if we pretend our samples came from MCMC
coda::HPDinterval(coda::as.mcmc(poisson_samples),
prob = .5)
## lower upper
## var1 4 6
## attr(,"Probability")
## [1] 0.5
For the gamma distribution:
quantile(gamma_samples, probs = c(.25, .75))
## 25% 75%
## 9.591237 26.491549
coda::HPDinterval(coda::as.mcmc(gamma_samples),
prob = .5)
## lower upper
## var1 3.738561 18.3761
## attr(,"Probability")
## [1] 0.5
Note that these are fairly different!
We can plot all of these possibilities:
df <- data_frame(samples = gamma_samples)
p <- ggplot(df, aes(x = samples)) + geom_density()
p +
geom_vline(xintercept = mean(gamma_samples), color = "blue") +
labs(title = "Mean")
p +
geom_vline(xintercept = median(gamma_samples), color = "purple") +
geom_vline(xintercept = quantile(gamma_samples, probs = c(.25, .75)),
color = "purple", linetype = "dashed") +
labs(title = "Median and 25%-75% percentile")
hdi <- coda::HPDinterval(coda::as.mcmc(gamma_samples), prob = .5)
p +
geom_vline(xintercept = LaplacesDemon::Mode(gamma_samples), color = "red") +
geom_vline(xintercept = c(hdi[1], hdi[2]),
color = "red", linetype = "dashed") +
labs(title = "Mode and 50% HDI")
What should you use? It depends! The mean is commonly used in Bayesian statistics, though rstan
and bayesplot
use the median and percentiles by default. The mode is less common, but it has a nice correspondence to the maximum likelihood estimate. In a Bayesian context, this is called maximum a posteriori estimation.
Approximating probability masses and densities through sampling is called Monte Carlo simulation. For continuous distributions, we’re approximating an integral. You can do this for arbitrary parts of the distribution:
normal_samples <- rnorm(nsims, mean = 0, sd = 1)
# how much probability is between -1 and +1 sd in a normal distribution?
sum(normal_samples >= -1 & normal_samples <= 1) / nsims
## [1] 0.679
# how accurate is this?
pnorm(1) - pnorm(-1)
## [1] 0.6826895
# what about 0 and .5 sd?
sum(normal_samples >= 0 & normal_samples <= .5) / nsims
## [1] 0.1842
pnorm(.5) - pnorm(0)
## [1] 0.1914625
Try changing nsims
and see how the approximation improves.
Information content (also called surprisal) is a way of characterizing the amount of information gained from sampling. It’s related to probability: \(-log(p)\). \(p\) comes from the distribution you choose to model the sample, so how informative or surprising you think each observation is depends on what model you choose for it.
The expected information content of a sample is the entropy, which is \(- \sum p_i log(p_i)\) for a discrete variable. For continuous distributions, there are often analytic solutions for entropy: https://en.wikipedia.org/wiki/Differential_entropy#Differential_entropies_for_various_distributions
# calculate surprisal for our normal samples
df <-
data_frame(samples = normal_samples) %>%
rowid_to_column(var = "n") %>% # index for each observation
mutate(p = dnorm(samples, mean = 0, sd = 1), # probability of each observation
surprisal = -log(p), # information content of each obs
cumulative_sum = cumsum(surprisal), # sample surprisal up to n
cumulative_mean = cummean(surprisal)) # average sample surprisal to n
# calculate entropy
ent_normal <- 0.5 * log(2 * pi * exp(1) * 1^2)
ent_normal
## [1] 1.418939
# plot average surprisal
ggplot(df, aes(x = n, y = cumulative_mean)) +
geom_line() +
labs(y = "Surprisal (cumulative average)") +
geom_hline(yintercept = ent_normal, linetype = "dashed", color = "blue")
If you choose the wrong parameters or wrong model, the average sample surprisal will converge to something higher than the actual entropy.
Notice that the beta distribution is continuous and bounded between 0 and 1, which makes it convenient for representing probabilities. For next week, think about what we might be able to do if we made a grid of probabilities rather than Xs for the binomial distribution.
sessionInfo()
## R version 3.5.3 (2019-03-11)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bindrcpp_0.2.2 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5
## [5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
## [9] ggplot2_3.1.0 tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.4 reshape2_1.4.3 haven_1.1.1
## [4] lattice_0.20-38 colorspace_1.4-0 htmltools_0.3.6
## [7] yaml_2.1.19 utf8_1.1.4 rlang_0.3.0.1
## [10] pillar_1.2.3 foreign_0.8-71 glue_1.3.1
## [13] withr_2.1.2 modelr_0.1.2 readxl_1.1.0
## [16] bindr_0.1.1 plyr_1.8.4 munsell_0.5.0
## [19] gtable_0.2.0 cellranger_1.1.0 rvest_0.3.2
## [22] coda_0.19-1 LaplacesDemon_16.1.0 psych_1.8.4
## [25] evaluate_0.10.1 labeling_0.3 knitr_1.20
## [28] parallel_3.5.3 broom_0.4.4 Rcpp_1.0.0
## [31] scales_1.0.0 backports_1.1.2 jsonlite_1.6
## [34] mnormt_1.5-5 hms_0.4.2 digest_0.6.18
## [37] stringi_1.2.3 grid_3.5.3 rprojroot_1.3-2
## [40] cli_1.0.0 tools_3.5.3 magrittr_1.5
## [43] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.1
## [46] xml2_1.2.0 lubridate_1.7.4 assertthat_0.2.0
## [49] rmarkdown_1.10 httr_1.3.1 rstudioapi_0.7
## [52] R6_2.3.0 nlme_3.1-137 compiler_3.5.3