STAT 530, Fall 2016 -------------------- Homework 1 ----------- 1) Suppose our multivariate data have sample covariance matrix S = [ 2 -3 2 -3 6 4 2 4 3 ] Note you can define this matrix in R with the code: my.S <- matrix(c(2,-3,2,-3,6,4,2,4,3), byrow=T, nrow=3, ncol=3) a) Based on this covariance matrix, how many columns (variables) does the original data matrix have? Can you tell how many rows the original data matrix has? b) Find the inverse of S. c) Find and write the correlation matrix for this data set. NOTE: The R functions 'matrix' and 'solve' can help with this problem. In R code: solve(M) will give the inverse of a matrix M. 2) Suppose a multivariate data set has sample covariance matrix S = [16 -2 4 -2 9 -1 4 -1 25] (See the hint in problem 1 for how to define a matrix in R.) a) Determine the matrix D^{-1/2}, where D^{-1/2} is defined in the Chapter 1 notes. b) Calculate the sample correlation matrix R for this data set. NOTE: In R code: M %*% N performs the matrix multiplication of M times N. 3) The air pollution data set (from chapter 2) is given on the course web page. For this problem, we will focus only on the first 16 observations (cities). You can read the data into R (as a data frame) with the code: airpol.full <- read.table("http://www.stat.sc.edu/~hitchcock/airpoll.txt", header=T) city.names <- as.character(airpol.full[1:16,1]) airpol.data.sub <- airpol.full[1:16,2:8] # Perform your analysis on the 'airpol.data.sub' subset. a) Use R to calculate the sample covariance matrix and the sample correlation matrix for this data subset. Identify which pairs of variables seem to be strongly associated. Write a paragraph describing the nature (strength and direction) of the relationship between these variable pairs. NOTE: Use the information about the variables given in Section 2.2 of the book (e.g., top of page 19) to help you in your discussions about the variables. b) Use R to calculate the distance matrix for these observations (after scaling the variables by dividing each variable by its standard deviation). Write a paragraph describing some of the most similar pairs of cities and some of the most different pairs of cities, giving evidence from the distance matrix. c) Give a plot that will help assess whether this data set comes from a multivariate normal distribution. What is your conclusion, based on the plot? GRADUATE STUDENTS ONLY (extra credit for undergrads): Regardless of your answer to (3)c), attempt some Box-Cox transformation(s) on the 'airpol.data.sub' data. Can any transformation(s) improve the multivariate normality? Discuss this in a paragraph.