## # This is an example of simple linear regression in R # This code will analyze data from a # Simple Linear Regression (SLR) model # The data given here are the house size and house price # from the example we studied in class # I am calling the data set "SizePrice". # The independent variable is size. # The response (dependent) variable is price. ########## ## # Reading the data into R: my.datafile <- tempfile() cat(file=my.datafile, " 0.951 30 1.036 39.9 0.676 46.5 1.456 48.6 1.186 51.5 1.456 56.99 1.368 59.9 0.994 62.5 1.176 65.5 1.216 69 1.41 76.9 1.344 79 1.064 79.9 1.77 79.95 1.524 82.9 1.75 84.9 1.152 85 1.77 87.9 1.624 89.9 1.54 89.9 1.532 93.5 1.647 94.9 1.344 95.8 1.55 98.5 1.752 99.5 1.45 99.9 1.312 102 1.636 106 1.5 108.9 1.8 109.9 1.972 110 1.387 112.29 2.082 114.9 NA 119.5 2.463 119.9 2.572 119.9 2.113 122.9 2.016 123.938 1.852 124.9 2.67 126.9 2.336 129.9 1.98 132.9 2.483 134.9 2.809 135.9 2.036 139.5 2.298 139.99 2.038 144.9 2.37 147.6 2.921 149.99 2.262 152.55 2.456 156.9 2.436 164 1.92 167.5 2.949 169.9 3.31 175 2.805 179 2.553 179.9 2.51 189.5 3.627 199 ", sep=" ") options(scipen=999) # suppressing scientific notation SizePrice <- read.table(my.datafile, header=FALSE, col.names=c("size", "price")) # Note we could also save the data columns into a file and use a command such as: # SizePrice <- read.table(file = "z:/stat_516/filename.txt", header=FALSE, col.names = c("size", "price")) attach(SizePrice) # The data frame called SizePrice is now created, # with two variables, size and price. ## ######### # Let's do a scatter plot to see if a linear relationship is appropriate. # In R, we list the independent variable first. plot(size, price) # The linear relationship seems reasonable based on the plot. # Let's use the lm() function to do estimate the slope and intercept of the linear model # Let's also plot the estimated regression line over our scatterplot: # The statement: price ~ size tells R that # the dependent variable is price and the independent variable is size. pricesize.reg <- lm(price ~ size) summary(pricesize.reg) # R gives a nice plot of the points with the # connected estimated regression line overlain on it. plot(size, price); abline(pricesize.reg) # So our least-squares regression line is mu-hat = 5.43 + 56.08 X # How should we interpret these parameter estimates? ###############################################################################* # The ANOVA table partitions the sums of squares: anova(pricesize.reg) # Note TSS = 93232, SSR = 71534, SSE = 21698. # Note that TSS = SSR + SSE. # Our estimate of sigma^2, MSE, can be found in the ANOVA table as 387.46904. # Note that our estimate of sigma, "Residual Standard Error" as R calls it, is 19.68. # (given above from the "summary" command) ###############################################################################** # Testing whether the true slope is zero: # We test H_0: beta_1 = 0 against the two-tailed alternative. # The test statistic value t for this test is given as 13.59 # and the P-value is nearly 0. We reject H_0 and conclude # the inclusion of X is the model is warranted. # What about a 95% CI for beta_1? # Note that the estimated slope is 56.083 and its standard error is given as 4.128. # From table A2, t_(.025) = 2.004 (for n-2 = 56 d.f.). # The 95% CI for beta_1 is (56.083 - 2.004*4.128, 56.083 + 2.004*4.128). # Therefore the 95% CI for beta_1 is (47.81, 64.36). # How do we interpret this CI? ###############################################################################** # INFERENCES ABOUT THE RESPONSE VARIABLE # We want to (1) estimate the mean selling price for houses of size 1750 sq. feet # and (2) predict the selling price of a new house of size 1750 sq. feet. x.value <- data.frame(size = 1.750) predict(pricesize.reg, x.value) # This shows the prediction # for the house which was 1750 square feet. We see the predicted # selling price for this house is about 103.577 thousand dollars. # getting the 95% confidence interval for the mean at X=1750: x.value <- data.frame(size = 1.750) predict(pricesize.reg, x.value, interval="confidence", level=0.95) # getting the 90% prediction intervals for a new observation with X=1750: x.value <- data.frame(size = 1.750) predict(pricesize.reg, x.value, interval="prediction", level=0.95) # The 95% CI for mean damage for all houses of 1750 square feet is between # 98.284 and 108.871 thousand dollars. # The prediction interval for the damage to a new house that is 1750 square feet # is between 63.791 and 143.363 thousand dollars. ###############################################################################** # One way to find the correlation coefficient r to to take the square root # of r^2 (r^2 is given on the summary() output). Be sure to give r the same sign as # the estimated slope of the regression. # For this example, r^2 is given as 0.7673 (how do we interpret this?) and so # r = (0.7673)^0.5 = 0.876. We know r is positive since the slope, 56.08, is positive. # With the cor() function, R will give us the correlation coefficient r directly: cor(price, size, use="c") # R shows that r = 0.87594. cor.test(price, size, use="c")$p.value # It also gives the P-value for the two-tailed test of # whether the population correlation coefficient is 0. This P-value is (near) 0, so # we reject the null and conclude that the true correlation between size and price # is not zero. #####################################################################################** # The following R code will produce a residual plot and a Q-Q plot of the residuals: # residual plot: plot(fitted(pricesize.reg), resid(pricesize.reg)); abline(h=0) # Q-Q plot of the residuals: qqnorm(resid(pricesize.reg)) #####################################################################################**