--- title: "Hypothesis Tests for Gaussian Means" author: "CS 3130 / ECE 3530" date: "November 30, 2017" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment="") ``` ## Comparing Gaussian and Student's $t$ Distributions First, let's review the standardized mean statistics. Say we have a random sample $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$. If we knew the values for $\mu$ and $\sigma$, we could create the standardized $Z$ statistic: \begin{equation*} Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}}. \end{equation*} As we learned before, this statistic has a "standard"" normal distribution, $Z \sim N(0, 1)$. Now, if we didn't know the standard deviation, $\sigma$, we could replace this with the estimated standard deviation, $S_n$. This gives us the $T$ statistic: \begin{equation*} T_n = \frac{\bar{X}_n - \mu}{S_n / \sqrt{n}}. \end{equation*} This statistic has a Student's $t$ distribution with $n - 1$ degrees of freedom, $T_n \sim t(n - 1)$. Now, let's take a look at some R code to see how this works. First, we simulate Gaussian random variables $X_1, X_2, \ldots, X_n$ and compute the $Z_n$ and $T_n$ statistics. We'll repeat this process several times to see what the distributions of the $Z_n$ and $T_n$ look like (remember, they are random variables!). ```{r} ## Number of times to repeat the experiment numRepeats = 40000 ## Sample size of each experiment (number of X_i Gaussians to simulate) n = 10 ## Mean and standard deviation of the X_i Gaussians mu = 10 sigma = 2 ## Set up a vector (aka, a 1D array) to hold our Z and T statistics zStats = numeric(numRepeats) tStats = numeric(numRepeats) ## Simulate! for(i in 1:numRepeats) { x = rnorm(n, mean = mu, sd = sigma) ## Compare the code here to the Z and T formulas above zStats[i] = (mean(x) - mu) / (sigma / sqrt(n)) tStats[i] = (mean(x) - mu) / (sd(x) / sqrt(n)) } ## Clamp data just so the histogram function doesn't complain zStats = zStats[zStats > -6 & zStats < 6] tStats = tStats[tStats > -6 & tStats < 6] ``` Let's plot the results and see how the $Z$ and $T$ distributions compare. ```{r} ## Plot a histogram of the Z statistics hist(zStats, freq = FALSE, xlim = c(-6,6), ylim = c(0,0.4), breaks = seq(-6,6,0.2), main = "Normalized Mean: Gaussian Z", col = "lightblue") ## Compare this to a N(0, 1) density t = seq(-6, 6, 0.02) lines(t, dnorm(t), lwd = 3, col = 'red') ## Plot a histogram of the T statistics hist(tStats, freq = FALSE, xlim = c(-6,6), ylim = c(0,0.4), breaks = seq(-6,6,0.2), main = "Normalized Mean: Student's T", col = "lightblue") ## Compare this to a t(n - 1) density lines(t, dt(t, df = n - 1), lwd = 3, col = 'red') ``` Notice they look very similar! However, if we plot the $Z$ and $T$ density on top of each other, we can see that the $T$ distribution has ``heavier tails'', i.e., more probability farther from zero. ```{r} ## Comparing Gaussian Z to Student T density plot(t, dnorm(t), type = 'l', lwd = 3, col = 'red', xlim = c(-6,6), ylim = c(0,0.4), main = "Gaussian Z vs. Student's T") lines(t, dt(t, df = n - 1), lwd = 3, col = 'blue') legend("topright", c("Gaussian Z", "Student T"), col = c("red", "blue"), lwd = 3) ``` The critical values for $Z$ and $T$ are different---these are the $z_{\alpha/2}$ and $t_{\alpha/2}$ we used in confidence intervals. In the next figure, the red area is the middle 95\% of the Gaussian distribution and the blue area is the middle 95\% of the Student's $t$ distribution. ```{r} x = seq(-6,6,0.1) y = dnorm(x) plot(x, y, type = 'l', lwd = 3, col = 'red', main = "Critical Value of a Gaussian Z vs. Student's T") criticalValue = qnorm(0.975) x = seq(-criticalValue, criticalValue, 0.1) y = dnorm(x) polygon(c(-criticalValue, x, criticalValue), c(0, y, 0), col = "red") text(-0.1, 0.15, format(0.95, digits = 3)) x = seq(-6,6,0.1) y = dt(x, df = n - 1) lines(x, y, type = 'l', lwd = 3, col = 'blue') criticalValue = qt(0.975, df = n - 1) x = seq(-criticalValue, criticalValue, 0.1) y = dt(x, df = n - 1) polygon(c(-criticalValue, x, criticalValue), c(0, y, 0), col = rgb(0,0,1,0.5)) legend("topright", c("Gaussian Z", "Student T"), col = c("red", "blue"), lwd = 3) ``` ## Single Sample Gaussian Hypothesis Test Say we have a random sample $X_1, X_2, \ldots, X_n \sim N(\mu, \sigma^2)$ and we want to do a hypothesis test that the mean $\mu$ is greater than some value $\mu_0$. Then we would set up the null and alternate hypotheses: $$ \begin{aligned} H_0 &: \mu = \mu_0,\\ H_1 &: \mu > \mu_0. \end{aligned} $$ To test this, we compute again the $T$ statistic. We assume we don't know the true $\sigma$ and have to estimate it with $S_n$. But, we also are assuming the null hypothesis, so we know the value of $\mu$ is $\mu_0$: $$T_n = \frac{\bar{X}_n - \mu_0}{S_n / \sqrt{n}}.$$ This $T_n$ statistic follows a $t(n - 1)$ distribution under the $H_0$ assumption. Remember, the $p$-value is the probability of the data being _more extreme_ under the null hypothesis. In this case, "more extreme" means greater than $\mu_0$ according to our alternate hypothesis $H_1$. So, we can get our $p$-value like so: $$p = P(T_n > t_n).$$ __Example:__ Let's test the hypothesis that the Michelson-Morley data was on average higher than the true speed of light: Null hypothesis, $H_0 : \mu =$ "true speed" Alternative hypothesis, $H_1 : \mu >$ "true speed" ```{r} ## True speed of light (in km/s minus 299,000 km/s) trueSpeed = 792.458 ## Let's test the sample mean from just the 4th run (this was the ## closest to correct) x = morley$Speed[morley$Expt == 4] (sampleMean = mean(x)) sampleSigma = sd(x) n = length(x) ## Plot this data vs. the true speed of light boxplot(x, main = "Michelson-Morley Speed of Light Experiment", ylab = "speed - 299,000 km/s") abline(h = trueSpeed, col = 'red') ``` ```{r} ## Use a one-sided t test to get a p-value (tStat = (sampleMean - trueSpeed) / (sampleSigma / sqrt(n))) (pValue = 1 - pt(tStat, df = n - 1)) ## We reject the null hypothesis if this p-value is < 0.05 (alpha) (pValue < 0.05) ## All of these steps can be done with a single R command. (This is ## what you would use in practice!) t.test(x - trueSpeed, alternative = "greater") ``` ## Two Sample Gaussian Hypothesis Test __Example:__ Let's look at the iris data again and test differences in the species. Say we are interested in if the _virginica_ and _versicolor_ species have different sepal lengths. Our null hypothesis, $H_0$, is "average sepal lengths of the two species are equal". Our alternative hypothesis is $H_1$, "the average sepal lengths are different". First, we will visualize the data using boxplots. Boxplots are great for comparing two different data sets quickly. ```{r} ## Let's reduce it down to just two of the three species iris2 = iris[iris$Species == "virginica" | iris$Species == "versicolor",] iris2$Species = factor(iris2$Species, levels = c("versicolor", "virginica")) ## First, let's do some boxplots of the data boxplot(Sepal.Length ~ Species, data = iris2, ylab = "Sepal Length") ``` Looks like there may be a difference, but there is some overlap in the boxplots. So, let's do the two sample $t$-test. ```{r} ## Set up the two lists of data vers = iris2$Sepal.Length[iris2$Species == "versicolor"] virg = iris2$Sepal.Length[iris2$Species == "virginica"] n = length(vers) m = length(virg) ## Assuming equal variances, we start with computing the pooled variance (sp = ((n - 1) * var(vers) + (m - 1) * var(virg)) / (n + m - 2) * (1/n + 1/m)) ## Now, we construct the t-statistic (tStat = (mean(vers) - mean(virg)) / sqrt(sp)) ## Finally, we get a p-value by looking up in the t distribution. ## Note this is a two-sided test because of our alternate hypothesis. ## In the two sided test we have to calculate the probability of being "more ## extreme" to the right of |t| and to the left of -|t| (p = (1 - pt(abs(tStat), df = m + n - 2)) + pt(-abs(tStat), df = m + n - 2)) ## This is how to do this hypothesis test step-by-step. ## In practice it is easiest to use the built-in "t.test" function in R t.test(vers, virg, var.equal = TRUE) ``` Now let's say that we had a hunch that versicolor would have shorter sepals than virginica before we started the experiment. Then we might have a one-sided alternate hypothesis, $H_1$ : (versicolor mean) $<$ (virginica mean) Now, our $p$-value changes because we only count a single side of the distribution as "more extreme". Notice the $p$-value is really just cut in half. ```{r} (p = pt(tStat, df = m + n - 2)) ## Again, the simple t.test way looks like this: t.test(vers, virg, var.equal = TRUE, alternative = "less") ```