STAT 530, Fall 2016
--------------------

Homework 3
-----------

#### Only graduate students are required to do problem 5 below.  It is extra credit for undergraduates.
#### Undergraduates must do problems 1, 2, 3 and 4 below.
#### Graduate students must do all five problems.


# 1)  Suppose our multivariate data have covariance matrix S =

[5  0  0
 0  9  0
 0  0  9]

a) Find the eigenvalues and eigenvectors of S.

HINT:  You can use the 'eigen' function in R, e.g.:
eigen(M)
gives the eigenvalues and eigenvectors of some matrix M.

b) Determine all three principal components for such a data set, using a PCA based on S.

c) What can you say about the principal components associated with 
   eigenvalues that are the same value?

HELPFUL NOTE:  In R, we can do a PCA where the covariance matrix is input rather than the 
data matrix using code such as:

my.pc <- princomp(covmat=my.S)

where my.S is the covariance matrix.


# 2)  Suppose a multivariate data set has sample covariance matrix S =

[36  5
  5  4]

a) Determine both principal components for such a data set, using a PCA based on S.  
(Refer to HELPFUL NOTE from problem 1).

b) Determine the correlation matrix R that corresponds to the covariance matrix S.

c) Determine both principal components for such a data set, using a PCA based on R.
Are the PCA results different from those in part (a)?  If so, try to explain why they are different.

HELPFUL NOTE 2:  We can do a PCA where the correlation matrix is input rather than the 
data matrix using code such as:

my.pc <- princomp(covmat=my.R)

where my.R is the correlation matrix.


R TIP:

The code:
M <- matrix(c(w,x,y,z), nrow=2, ncol=2, byrow=T)

(where w,x,y,z are some numbers) will produce a 2-by-2 matrix M with entries:

[w  x
 y  z]


########################################################################################################
NOTE: For EACH of the following problems, also write (at least) a couple of paragraphs explaining the
choices you made in doing the PCA and explaining in words what substantive conclusions 
about the data that you can draw from the PCA.  You can use relevant results and 
graphics to support your conclusions.
########################################################################################################

#### Only graduate students are required to do problem 5.  It is extra credit for undergraduates.

# 3) Do problem 3.5 from the textbook.

See HELPFUL NOTE 2 and R TIP above, which will also help with this problem.

# 4) Do problem 3.6 from the textbook.  Don't worry about the scatterplot matrix; just perform
an appropriate PCA, including the appropriate display of PC scores.  

*The "Contents of Foodstuffs" data set (in Table 3.6) is given on the course web page.
Full descriptions of the observation names are on p. 63 of the book.

This R code will read in the data:

food.full <- read.table("http://www.stat.sc.edu/~hitchcock/foodstuffs.txt", header=T)
food.labels <- as.character(food.full[,1])
food.data <- food.full[,-1]


# 5) Do problem 3.7 from the textbook.

*** For 3.7:  When the problem talks about "those [results] given in the text derived 
from using a robust estimate of the correlation matrix," it is talking about the PCA 
discussed starting at the bottom of p. 56, whose results are presented in Table 3.5 
on p. 61.  You should read this section of the book carefully.


*The U.S. air pollution data set (from chapter 3, different from the Chapter 2 air pollution data) 
is given on the course web page.  This R code will read in the data:


USairpol.full <- read.table("http://www.stat.sc.edu/~hitchcock/usair.txt", header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]

# To remove one or two outlying observations, try code such as:

out.row1 <-        # Put the row number of the most severe outlier here (after the arrow)
USairpol.data.reduced <- USairpol.data[-out.row1,]
out.row2 <-        # Put the row number of another outlier here (after the arrow)
USairpol.data.reduced.more <- USairpol.data[-c(out.row1,out.row2),]