STAT 530, Fall 2018 -------------------- Homework 5 ----------- Do the following problems. The first three are required for everyone and the fourth is required for graduate students and optional (extra credit) for undergrads. IMPORTANT NOTE: For these problems, in particulare the clustering problems, also write several sentences explaining in words what substantive conclusions about the data that you can draw from the plots and/or analyses. Problem 1: Do both a hierarchical clustering and a partitioning clustering of the tennis racquet data on the course web page. For each clustering, you may pick your favorite specific approach. Give the partitions of racquets into clusters, give some plot(s) to visualize the cluster structure, and make an attempt to characterize the clusters. The racquet data can be read in with the following code: racq.data <- read.table("http://people.stat.sc.edu/hitchcock/racquetsdata530.txt",header=T) racquet.names <- as.character(racq.data[,1]) racquet.numeric.data <- racq.data[,-1] The variables in the tennis racquets data set are: X1 = length of racquet (in inches) X2 = static weight (in ounces) = this is how much the racquet actually weighs on a scale X3 = balance (in inches) = this is a measure of whether the racquet is heavier in on the head end or on the handle end; more negative values indicate a more head-heavy racquet; positive values indicate a more head-light racquet; zero indicates an even balance. X4 = swingweight = this is a complicated measure of how heavy the racquet FEELS when it is swung X5 = headsize (in square inches) = the size of the racquet face (the strung area) X6 = beamwidth (in mm) = the width of the cross-section (edge) of the racquet PROBLEM 2: --------------- Use linear discriminant analysis (LDA) to build a classification rule to classifying the Bumpus bird data into two groups ("survived" and "died") based on the 5 numerical measurements. Assume equal prior probabilities of surviving and dying. The Bumpus bird data (along with a survival/death indicator vector) can be read in using the following R code: bumpbird <- read.table("http://people.stat.sc.edu/hitchcock/bumpusbird.txt", header=T) names(bumpbird) <- c("ID", "tot.length", "alar.length", "beak.head.length", "humerus.length", "keel.stern.length") attach(bumpbird) bumpbird.numeric <- bumpbird[,-1] bumpbird.IDs <- bumpbird[,1] survival.indicator <- as.factor(c(rep("survived",times=21),rep("died",length=28))) (a) Use the LDA rule to predict the survival status for a hypothetical bird with: tot.length=156, alar.length=242, beak.head.length=31.4, humerus.length=18.1, keel.stern.length=19.4 Give the probability of surviving for such a bird. (b) Find the plug-in misclassification rate and the cross-validation misclassification rate for the LDA classification rule. PROBLEM 3: --------------- Do Problem 7.4 in the Everitt textbook to perform supervised classification on the Skulls data we looked at in class, but use the CLASSIFICATION TREE approach of the Skulls data to obtain the classification tree (show the plot of the tree) and classify into an Epoch the new skull with the measurements given in problem 7.4 (MB=133.0, BH=130.0, BL=95.0, NH=50.0). You may assume equal prior probabilities of being in each category. Recall: The Skulls data set was studied in the Chapter 7 examples we did in class with R. GRADUATE STUDENT PROBLEM: (This is optional (extra credit) for undergraduate students.) Do problem 6.3 in the book: Do a model-based clustering of the pottery data set. NOTE: Display both the clustering result for the number of clusters that BIC suggests, and then give the result for the best 3-cluster solution. Which do you prefer? The pottery data can be read in with the following code: pottfull<-read.table("http://people.stat.sc.edu/hitchcock/potteryTable63.txt", header=T) attach(pottfull) pott<-pottfull[,-c(1,2)] NOTE: The racquet data, and the pottery data in Table 6.3 are given on the course web page. *** Read Section 6.3 for some insight into the variables in the pottery data set. Also note that "No" (number) and "Kiln" are simply labeling variables and should NOT be included in the cluster analysis algorithm.