---
title: "Discrete Distributions"
author: "Tom Fletcher"
date: "September 19, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "")
```

In this example, we will see how to use the Binomial and Geometric distributions in R. Remember, if you ever want to read the help file for a command, type a question mark followed by the command name. For example, type this at the R command prompt:
```{r, eval=FALSE}
?pbinom
```

## Plotting the Binomial and Geometric pmf and cdf

The "seq" command creates a sequence of numbers. Here are the integers from 0 to 20.
```{r}
(x = seq(0, 20))
```

R has built in functions for the pmf or cdf for many different types of random variables. A prefix of 'd' is for the density (or pmf), and the prefix of 'p' is for the cdf. Here is an example of the binomial distribution pmf:
```{r}
dbinom(5, 10, 0.5)
```

The 5 is the 'k' variable (number of successes), the 10 is the 'n' variable (number of trials), and the 0.5 is the 'p' variable (probability of success).

Now lets plot the pmf for $\mathrm{Bin}(20, 0.25)$ like so:
```{r}
plot(x, dbinom(x, 20, 0.5))
```

I like to plot pmf's with vertical bars:
```{r}
plot(x, dbinom(x, 20, 0.25), type = 'h', lwd = 6, main = "Binom(20, 0.25) PMF")
```

Notice we are using the 'x' sequence we defined above, and plotting all values for $x = 0,1, \ldots, 20$.

Similarly we can plot the cdf:
```{r}
plot(x, pbinom(x, 20, 0.25))
```

Creating a step function will produce a better cdf plot. (Note that we have to concatenate a 0 at the begining of ` pbinom`.) Type ` ?plot.stepfun` to see the help on plotting step functions.
```{r}
plot(stepfun(x, c(0, pbinom(x, 20, 0.25))), lwd = 5, do.points = FALSE, main = "Binom(20, 0.25) CDF")
```

You can create functions to make your life easier. Here I'll set up pmf and cdf plots with all the options above, just so we don't have to type it all in again. Note the "..." in the parameter list means we will pass all remaining parameters through to the plot function.
```{r}
pmf.plot = function(x, y, ...) 
{ plot(x, y, type = 'h', lwd = 5, ylab = "p(x)", ...) }

cdf.plot = function(x, y, ...)
{ plot(stepfun(x, c(0, y)), lwd = 5, do.points = FALSE, ylab = "F(x)", ...) }
```

You might also try out the geometric distribution. Notice there is no longer an 'n' parameter.
```{r}
pmf.plot(x, dgeom(x, 0.5), main = "Geom(0.5) PMF")
cdf.plot(x, pgeom(x, 0.5), main = "Geom(0.5) CDF")
```

Try plotting these distributions with different values of 'p' and 'n'. How does the shape of the pmf change?


## Example Problems

### 1. If I roll a six-sided die 12 times, what is the probability of rolling a 6 exactly twice?

This is a Binomial distribution problem because we run exactly $n = 12$ trials of a Bernoulli experiment. Let $X \sim \mathrm{Bin}(n = 12, p = 1/6)$. Then we have:

$P(X = 2) =$
```{r}
dbinom(2, size = 12, prob = 1/6)
```

Notice R is just using the formula we learned in class. This is the same number:

$P(X = 2) = {{12}\choose{2}} (\frac{1}{6})^2 (\frac{5}{6})^{10} =$
```{r}
choose(n = 12, k = 2) * (1/6)^2 * (5/6)^10
```

### 2. Same 12 rolls, what is the probability of rolling _no more_ than 2 sixes?

Here we use the Binomial cdf to get the probability of _less than or equal to_ 2 sixes, that is, $P(X \leq 2)$:

$P(X \leq 2) =$
```{r}
pbinom(2, size = 12, prob = 1/6)
```

Again, we could have computed this ourselves by summing the pmf to get the cdf:

$P(X \leq 2) = \sum_{k=0}^2 {{12}\choose{k}} (\frac{1}{6})^k (\frac{5}{6})^{12 - k} =$
```{r}
kseq = seq(0, 2)
sum(choose(n = 12, k = kseq) * (1/6)^kseq * (5/6)^(12 - kseq))
```

But it is easier to use the built-in R function `pbinom` directly!

### 3. If I roll a die until I first get a six, what is the probability that this will take exactly 4 rolls?

This is a Geometric distribution problem because we are going to keep going until our first "success" (flipping a heads). So, let's define our random variable as $X \sim \mathrm{Geom}(p = 1/6)$.

$P(X = 4) =$
```{r}
dgeom(3, prob = 1/6)
```

___WARNING:___ R uses the "number of failures" for the value of $X$, where the book and the notes use "the first success". This means you need to pass $k - 1$ to `dgeom` if you want $P(X = k)$.

Again, we can check this versus the equation for the Geometric pmf:

$P(X = 4) = (\frac{5}{6})^3 (\frac{1}{6}) = `r (5/6)^3 * (1/6)`$

### 4. Again, say I roll a die until I get a six. What is the probability that this will take no more than 4 rolls?

Here, the key phrase is "no more than 4". So, we want a Geometric __cdf__ to get $P(X \leq 4)$. This is

$P(X \leq 4) = \sum_{k=1}^4 (\frac{5}{6})^k (\frac{1}{6}) =$
```{r}
pgeom(3, prob = 1/6)
```

Or, doing it the hard way:
```{r}
kseq = seq(0, 3)
sum((5/6)^kseq * (1/6))
```


### 5. Simulate the 12 rolls of a die, repeated 10,000 times. What is the estimated probability of getting exactly 2 sixes?

The command `rbinom` will generate a simulation from a Binomial random variable. You have to give it (1) the number of simulations, (2) `size` the number of trials (we called this $n$ in class), and (3) `prob` the probability of success (we called this $p$).

```{r}
numSims = 10000
x = rbinom(numSims, size = 12, prob = 1/6)
```

Just for fun, let's take a look at the first several of these simulated random values:
```{r}
x[1:50]
```

Now, the probability of getting exactly 2 sixes is approximately:
```{r}
sum(x == 2) / numSims
```

### 6. Simulate rolling a die until you get your first six, repeated 10,000 times. What is the estimated probability that this takes no more than 4 rolls?

Now we want to simulate a Geometric random variable. This looks like:
```{r}
x = rgeom(numSims, prob = 1/6)
sum(x <= 3) / numSims
```

You should confirm that the last two simulations are close to the true probabilities we computed above.

### Exercise for you: think about what other types of questions we could answer with these basic tools. Try playing with the inputs to the R functions used above to answer different problems.