--- title: "Linear Regression" author: "CS 3130 / ECE 3530" date: "December 5, 2017" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment = NULL) ``` ## Linear Regression of the Janka Hardness Data We'll go through the regression example from the book, and I'll show how to do this type of analysis in R. This is the "Janka Hardness vs. Wood Density" data, see Table 15.5 and Ch 22 You can find the data here: http://www.coe.utah.edu/~cs3130/lectures/janka.txt (just save the text file in your current working directory for R) First, we load the data into a dataframe like so: ```{r} janka = read.table("janka.txt", header = TRUE) ``` The `header=TRUE` option tells R to use the first line as column names Try typing just `janka` into the command line to see the raw data: ```{r} janka ``` A scatterplot of the data is simple in R ```{r} plot(janka, main = "Hardness vs. Density of Timber") ``` This is the same as calling the following command (but notice the axes are labeled different) ```{r} plot(janka$hardness ~ janka$dens, main = "Hardness vs. Density of Timber") ``` The `y ~ x` means y is the dependent variable and x is the independent variable. The `lm` command stands for "linear model" and will fit a linear regression to paired data. It uses the same `y ~ x` syntax. ```{r} janka.model = lm(hardness ~ dens, data = janka) ``` The output of `lm` is an object with lots of information about the linear regression fit. Type the following command to see what variables it contains. ```{r} names(janka.model) ``` The first item to look at is the "coefficients" this is the intercept (alpha) and slope (beta) ```{r} janka.model$coefficients ``` The command "abline" will add a line to the plot with a certain intercept a (first parameter) and slope b (second parameter) ```{r} plot(janka$hardness ~ janka$dens, main = "Hardness vs. Density of Timber") abline(janka.model$coefficients[1], janka.model$coefficients[2], col = "red", lwd = 3) ``` The abline command is smart enough to know when you pass it a regression fit. The above command can be simplified like so: ```{r} plot(janka$hardness ~ janka$dens, main = "Hardness vs. Density of Timber") abline(janka.model, col = "red", lwd = 3) ``` Hypothesis testing of the intercept and slope parameters is easy in R. Here the null hypothesis is that the intercept (or respectively, slope) is equal to zero. The alternative hypothesis is that the intercept (respectively, slope) is not equal to zero. The p-values you see in the table are for these two hypotheses. ```{r} summary(janka.model) ``` The next thing we can look at is the residuals between the data and the model For the hypothesis testing assumptions to be valid, these should be roughly Gaussian: ```{r} hist(janka.model$residuals, main = "Residuals of Janka Regression") ``` Another way to visualize if the residuals seem Gaussian distributed is a Q-Q plot, which stands for "Quantile-Quantile" plot. It plots the empirical quantiles of the data (y axis) versus the theoretical quantiles of a Gaussian distribution (x axis). If the data comes from a Gaussian, the points should lie on a 45 degree line. See the Wikipedia page for more info: https://en.wikipedia.org/wiki/Q-Q_plot ```{r} qqnorm(janka.model$residuals / sd(janka.model$residuals)) abline(0, 1) ``` Here we plot the residual values vs. the independent variable. According to our regression assumptions, there shouldn't be a trend in these residuals. That is, the residuals should be i.i.d. regardless of the X value. ```{r} plot(janka.model$residuals ~ janka$hardness, main = "Residuals vs. Hardness") ``` ## Sea Slug Example Here's another example: the "Sea Slug" data from this URL: http://www.stat.ucla.edu/projects/datasets Try going through the same procedures with this data set. ```{r} seaslug = read.table("seaslug.txt", header = TRUE) ``` Before getting started, remove all entries with `Time=99` (this is a control group that should be removed). ```{r} seaslug = seaslug[seaslug$Time != 99,] ``` ```{r} ssModel = lm(seaslug$Percent ~ seaslug$Time) plot(seaslug, main = "Percent Sea Slug Metamorphosized vs. Time") abline(ssModel, col = "red") ``` ```{r} summary(ssModel) ``` I also recommend searching for "sea slug" in Google images :) ## HW 8 Hint! Here is an example for writing a function in R that takes two vectors (`x` and `y`) as inputs and returns two values in a list (like your function `my.regression` should do). ```{r} foo = function(x, y) { a = sum(x^2 + y^2) b = mean(cos(x) + sin(y)) return(list(a = a, b = b)) } ``` Example running the `foo()` function: ```{r} x = runif(10) y = rnorm(10) ans = foo(x, y) print(ans$a) print(ans$b) ```