STAT 530, Fall 2018
--------------------

Homework 3
-----------

#### Only graduate students are required to do problem 5 below.  It is extra credit for undergraduates.
#### Undergraduates must do problems 1, 2, 3 and 4 below.
#### Graduate students must do all five problems.

R TIP:

The code:
M <- matrix(c(w,x,y,z), nrow=2, ncol=2, byrow=T)

(where w,x,y,z are some numbers) will produce a 2-by-2 matrix M with entries:

[w  x
 y  z]

A 3-by-3 matrix can be obtained similarly by putting nine numbers separated by commas in c()
and using nrow=3, ncol=3 .


# 1)  Suppose our multivariate data have covariance matrix S =

[5  0  0
 0  9  0
 0  0  9]

a) Find the eigenvalues and eigenvectors of S.

HINT:  You can use the 'eigen' function in R, e.g.:
eigen(M)
gives the eigenvalues and eigenvectors of some matrix M.

b) Determine all three principal components for such a data set, using a PCA based on S.
   Remember that a principal component is a linear combination of the original variables,
   so your answer should be in the form of linear combinations of X1, X2, and X3.

c) What can you say about the principal components associated with 
   eigenvalues that are the same value?

HELPFUL NOTE:  In R, we can do a PCA where the covariance matrix is input rather than the 
data matrix using code such as:

my.pc <- princomp(covmat=my.S); summary(my.pc, loadings=T)

where my.S is the covariance matrix.


# 2)  Suppose a multivariate data set has sample covariance matrix S =

[36  5
  5  4]

a) Determine both principal components for such a data set, using a PCA based on S.  
(Refer to HELPFUL NOTE and R TIP from problem 1).
   Remember that a principal component is a linear combination of the original variables,
   so your answer should be in the form of linear combinations of X1 and X2.

b) Determine the correlation matrix R that corresponds to the covariance matrix S.

c) Determine both principal components for such a data set, using a PCA based on R.
Are the PCA results different from those in part (a)?  If so, try to explain why they are different.

HELPFUL NOTE 2:  We can do a PCA where the correlation matrix is input rather than the 
data matrix using code such as:

my.pc <- princomp(covmat=my.R); summary(my.pc, loadings=T)

where my.R is the correlation matrix.


########################################################################################################
NOTE: For EACH of the following problems, also write (at least) a couple of paragraphs explaining the
choices you made in doing the PCA and explaining in words what substantive conclusions 
about the data that you can draw from the PCA.  You can use relevant results and 
graphics to support your conclusions.
########################################################################################################

#### Only graduate students are required to do problem 5.  It is extra credit for undergraduates.

# 3) Problem 3.5 from the textbook asks you to do and interpret a principal components analysis on
the given correlation matrix, which can be entered into R with the following code:

my.cor.mat <- matrix(c(1,.402,.396,.301,.305,.339,.340,
.402,1,.618,.150,.135,.206,.183,
.396,.618,1,.321,.289,.363,.345,
.301,.150,.321,1,.846,.759,.661,
.305,.135,.289,.846,1,.797,.800,
.339,.206,.363,.759,.797,1,.736,
.340,.183,.345,.661,.800,.736,1),
ncol=7, nrow=7, byrow=T);

As mentioned in the book, the 7 variables are 'head length', 'head breadth', 'face breadth', 
'left finger length', 'left forearm length', 'left foot length','height'.

Obtain the principal components (including choosing an appropriate number of PCs).
Also make an attempt to interpret your PCs.

See HELPFUL NOTE 2 and R TIP above, which will also help with this problem.

# 4) Perform an appropriate PCA on the "Contents of Foodstuffs" data set (in Table 3.6), 
including the appropriate display of PC scores.  Also make an attempt to interpret your PCs.

*The "Contents of Foodstuffs" data set (in Table 3.6) is given on the course web page.
Full descriptions of the observation names are on p. 63 of the book.

This R code will read in the data:

food.full <- read.table("http://www.stat.sc.edu/~hitchcock/foodstuffs.txt", header=T)
food.labels <- as.character(food.full[,1])
food.data <- food.full[,-1]


# 5) (Only required for graduate students)

*The CHAPTER 3 U.S. air pollution data set (from chapter 3, different from the Chapter 2 air pollution data) 
is given on the course web page.  This R code will read in the data:

USairpol.full <- read.table("http://www.stat.sc.edu/~hitchcock/usair.txt", header=T)
city.names <- as.character(USairpol.full[,1])
USairpol.data <- USairpol.full[,-1]

Perform an appropriate PCA on the "CHAPTER 3 U.S. air pollution data set" data set (in Table 3.1), 
including the appropriate display of PC scores.  Identify any notable outliers.

Perform another PCA after removing one or two of the most severe outliers.  Comment on any differences in the PCA
results from this PCA compared to the PCA on the full data set.

# To remove one or two outlying observations, try code such as:

out.row1 <-        # Put the row number of the most severe outlier here (after the arrow)
USairpol.data.reduced <- USairpol.data[-out.row1,]
out.row2 <-        # Put the row number of another outlier here (after the arrow)
USairpol.data.reduced.more <- USairpol.data[-c(out.row1,out.row2),]