---
title: "Hypothesis Tests for Gaussian Means"
author: "CS 3130 / ECE 3530"
date: "November 30, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment="")
```

## Comparing Gaussian and Student's $t$ Distributions

First, let's review the standardized mean statistics. Say we have a random
sample $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$. If we knew the values for $\mu$ and $\sigma$, we could create the standardized $Z$ statistic:
\begin{equation*}
Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}}.
\end{equation*}
As we learned before, this statistic has a "standard"" normal distribution,
$Z \sim N(0, 1)$.

Now, if we didn't know the standard deviation, $\sigma$, we could replace this
with the estimated standard deviation, $S_n$. This gives us the $T$ statistic:
\begin{equation*}
T_n = \frac{\bar{X}_n - \mu}{S_n / \sqrt{n}}.
\end{equation*}
This statistic has a Student's $t$ distribution with $n - 1$ degrees of freedom,
$T_n \sim t(n - 1)$.

Now, let's take a look at some R code to see how this works. First, we simulate
Gaussian random variables $X_1, X_2, \ldots, X_n$ and compute the $Z_n$ and
$T_n$ statistics. We'll repeat this process several times to see what the
distributions of the $Z_n$ and $T_n$ look like (remember, they are random
variables!).

```{r}
## Number of times to repeat the experiment
numRepeats = 40000

## Sample size of each experiment (number of X_i Gaussians to simulate)
n = 10

## Mean and standard deviation of the X_i Gaussians
mu = 10
sigma = 2

## Set up a vector (aka, a 1D array) to hold our Z and T statistics
zStats = numeric(numRepeats)
tStats = numeric(numRepeats)

## Simulate!
for(i in 1:numRepeats) {
  x = rnorm(n, mean = mu, sd = sigma)
  ## Compare the code here to the Z and T formulas above
  zStats[i] = (mean(x) - mu) / (sigma / sqrt(n))
  tStats[i] = (mean(x) - mu) / (sd(x) / sqrt(n))
}

## Clamp data just so the histogram function doesn't complain
zStats = zStats[zStats > -6 & zStats < 6]
tStats = tStats[tStats > -6 & tStats < 6]
```

Let's plot the results and see how the $Z$ and $T$ distributions compare.
```{r}
## Plot a histogram of the Z statistics
hist(zStats, freq = FALSE, xlim = c(-6,6), ylim = c(0,0.4),
     breaks = seq(-6,6,0.2), main = "Normalized Mean: Gaussian Z",
     col = "lightblue")

## Compare this to a N(0, 1) density
t = seq(-6, 6, 0.02)
lines(t, dnorm(t), lwd = 3, col = 'red')

## Plot a histogram of the T statistics
hist(tStats, freq = FALSE, xlim = c(-6,6), ylim = c(0,0.4),
     breaks = seq(-6,6,0.2), main = "Normalized Mean: Student's T",
     col = "lightblue")

## Compare this to a t(n - 1) density
lines(t, dt(t, df = n - 1), lwd = 3, col = 'red')
```

Notice they look very similar! However, if we plot the $Z$ and $T$ density on
top of each other, we can see that the $T$ distribution has ``heavier tails'',
i.e., more probability farther from zero.
```{r}
## Comparing Gaussian Z to Student T density
plot(t, dnorm(t), type = 'l', lwd = 3, col = 'red',
     xlim = c(-6,6), ylim = c(0,0.4),
     main = "Gaussian Z vs. Student's T")
lines(t, dt(t, df = n - 1), lwd = 3, col = 'blue')
legend("topright", c("Gaussian Z", "Student T"),
       col = c("red", "blue"), lwd = 3)
```

The critical values for $Z$ and $T$ are different---these are the $z_{\alpha/2}$ and $t_{\alpha/2}$ we used in confidence intervals. In the next figure, the red area is the middle 95\% of the Gaussian distribution and the blue area is the middle 95\% of the Student's $t$ distribution.
```{r}
x = seq(-6,6,0.1)
y = dnorm(x)
plot(x, y, type = 'l', lwd = 3, col = 'red',
     main = "Critical Value of a Gaussian Z vs. Student's T")

criticalValue = qnorm(0.975)
x = seq(-criticalValue, criticalValue, 0.1)
y = dnorm(x)
polygon(c(-criticalValue, x, criticalValue), c(0, y, 0), col = "red")
text(-0.1, 0.15, format(0.95, digits = 3))

x = seq(-6,6,0.1)
y = dt(x, df = n - 1)
lines(x, y, type = 'l', lwd = 3, col = 'blue')

criticalValue = qt(0.975, df = n - 1)
x = seq(-criticalValue, criticalValue, 0.1)
y = dt(x, df = n - 1)
polygon(c(-criticalValue, x, criticalValue), c(0, y, 0), col = rgb(0,0,1,0.5))

legend("topright", c("Gaussian Z", "Student T"),
       col = c("red", "blue"), lwd = 3)
```

## Single Sample Gaussian Hypothesis Test

Say we have a random sample $X_1, X_2, \ldots, X_n \sim N(\mu, \sigma^2)$ and we want to do a hypothesis test that the mean $\mu$ is greater than some value
$\mu_0$. Then we would set up the null and alternate hypotheses:

$$
\begin{aligned}
  H_0 &: \mu = \mu_0,\\
  H_1 &: \mu > \mu_0.
\end{aligned}
$$

To test this, we compute again the $T$ statistic. We assume we don't know the
true $\sigma$ and have to estimate it with $S_n$. But, we also are assuming the
null hypothesis, so we know the value of $\mu$ is $\mu_0$:

$$T_n = \frac{\bar{X}_n - \mu_0}{S_n / \sqrt{n}}.$$

This $T_n$ statistic follows a $t(n - 1)$ distribution under the $H_0$
assumption. Remember, the $p$-value is the probability of the data being _more extreme_ under the null hypothesis. In this case, "more extreme" means greater than $\mu_0$ according to our alternate hypothesis $H_1$. So, we can get our $p$-value like so:

$$p = P(T_n > t_n).$$

__Example:__ Let's test the hypothesis that the Michelson-Morley data was on
average higher than the true speed of light:

Null hypothesis, $H_0 : \mu =$ "true speed"
Alternative hypothesis, $H_1 : \mu >$ "true speed"

```{r}
## True speed of light (in km/s minus 299,000 km/s)
trueSpeed = 792.458

## Let's test the sample mean from just the 4th run (this was the
## closest to correct)
x = morley$Speed[morley$Expt == 4]
(sampleMean = mean(x))
sampleSigma = sd(x)
n = length(x)

## Plot this data vs. the true speed of light
boxplot(x, main = "Michelson-Morley Speed of Light Experiment", ylab = "speed - 299,000 km/s")
abline(h = trueSpeed, col = 'red')
```

```{r}
## Use a one-sided t test to get a p-value
(tStat = (sampleMean - trueSpeed) / (sampleSigma / sqrt(n)))
(pValue = 1 - pt(tStat, df = n - 1))

## We reject the null hypothesis if this p-value is < 0.05 (alpha)
(pValue < 0.05)

## All of these steps can be done with a single R command. (This is
## what you would use in practice!)
t.test(x - trueSpeed, alternative = "greater")
```


## Two Sample Gaussian Hypothesis Test

__Example:__ Let's look at the iris data again and test differences in the
species. Say we are interested in if the _virginica_ and _versicolor_
species have different sepal lengths. Our null hypothesis, $H_0$, is "average
sepal lengths of the two species are equal". Our alternative hypothesis is
$H_1$, "the average sepal lengths are different".

First, we will visualize the data using boxplots. Boxplots are great for
comparing two different data sets quickly.
```{r}
## Let's reduce it down to just two of the three species
iris2 = iris[iris$Species == "virginica" | iris$Species == "versicolor",]
iris2$Species = factor(iris2$Species, levels = c("versicolor", "virginica"))

## First, let's do some boxplots of the data
boxplot(Sepal.Length ~ Species, data = iris2, ylab = "Sepal Length")
```

Looks like there may be a difference, but there is some overlap in the
boxplots. So, let's do the two sample $t$-test.

```{r}
## Set up the two lists of data
vers = iris2$Sepal.Length[iris2$Species == "versicolor"]
virg = iris2$Sepal.Length[iris2$Species == "virginica"]
n = length(vers)
m = length(virg)

## Assuming equal variances, we start with computing the pooled variance
(sp = ((n - 1) * var(vers) + (m - 1) * var(virg)) / (n + m - 2) * (1/n + 1/m))

## Now, we construct the t-statistic
(tStat = (mean(vers) - mean(virg)) / sqrt(sp))

## Finally, we get a p-value by looking up in the t distribution.
## Note this is a two-sided test because of our alternate hypothesis.
## In the two sided test we have to calculate the probability of being "more
## extreme" to the right of |t| and to the left of -|t|
(p = (1 - pt(abs(tStat), df = m + n - 2)) + pt(-abs(tStat), df = m + n - 2))

## This is how to do this hypothesis test step-by-step.
## In practice it is easiest to use the built-in "t.test" function in R
t.test(vers, virg, var.equal = TRUE)
```

Now let's say that we had a hunch that versicolor would have shorter sepals
than virginica before we started the experiment. Then we might have a
one-sided alternate hypothesis,

$H_1$ : (versicolor mean) $<$ (virginica mean)

Now, our $p$-value changes because we only count a single side of the
distribution as "more extreme". Notice the $p$-value is really just cut in
half.

```{r}
(p = pt(tStat, df = m + n - 2))

## Again, the simple t.test way looks like this:
t.test(vers, virg, var.equal = TRUE, alternative = "less")
```