STAT 530, Fall 2024 -------------------- Homework 6 ----------- PROBLEMS 1, 2, and 3 are required for everyone. Problem 4 is required for graduate students and extra credit for undergraduates. PROBLEM 1: --------------- Use linear discriminant analysis (LDA) to build a classification rule to classifying the Bumpus bird data into two groups ("survived" and "died") based on the 5 numerical measurements. The Bumpus bird data (along with a survival/death indicator vector) can be read in using the following R code: bumpbird <- read.table("http://people.stat.sc.edu/hitchcock/bumpusbird.txt", header=T) names(bumpbird) <- c("ID", "tot.length", "alar.length", "beak.head.length", "humerus.length", "keel.stern.length") attach(bumpbird) bumpbird.numeric <- bumpbird[,-1] bumpbird.IDs <- bumpbird[,1] survival.indicator <- as.factor(c(rep("survived",times=21),rep("died",length=28))) (a) Use the LDA rule to predict the survival status for a hypothetical bird with: tot.length=156, alar.length=242, beak.head.length=31.4, humerus.length=18.1, keel.stern.length=19.4 For part (a), assume equal prior probabilities of surviving and dying. Give the probability of surviving for such a bird. (b) Find the plug-in misclassification rate and the cross-validation misclassification rate for the LDA classification rule from part (a). (c) Use the LDA rule to predict the survival status for a hypothetical bird with: tot.length=156, alar.length=242, beak.head.length=31.4, humerus.length=18.1, keel.stern.length=19.4 For part (c), use the default prior probabilities which equal the sample proportions of birds surviving and dying. Give the probability of surviving for such a bird. (d) Find the plug-in misclassification rate and the cross-validation misclassification rate for the LDA classification rule from part (c). How do these compare to the rates that you found in part (b)? PROBLEM 2: --------------- (a) Use the CLASSIFICATION TREE approach on the Egyptian Skulls data in the Chapter 7 in-class R examples to obtain the classification tree (show the plot of the tree) and classify into an Epoch the new skull with the measurements: MB = 133.0, BH = 130.0, BL = 95.0, NH = 50.0 You may assume equal prior probabilities of being in each category. (b) Use the random forest approach to do the same classification as in part (a). Comment on any similarities and/or differences between your conclusions in part (a) and in part (b). What does the random forest approach tell you about the relative importance of the four various predictors in the classification? (c) Use this code (PUTTING IN A VALUE for k) to do the same classification as in part (a), using the K-nearest neighbors approach. newobs <- rbind( c(133,130,95,50) ) dimnames(newobs) <- list(NULL,c('MB','BH', 'BL', 'NH')); newobs <- data.frame(newobs) library(class) knn.pred = knn(train = skulls[,-1], test = newobs, cl = skulls[,1], prob=T, k= ) Use k=3, k=5, and k=9. Explain how (if at all) your classification results change for the different values of K. PROBLEM 3: --------------- *The CHAPTER 3 U.S. air pollution data set (from chapter 3, DIFFERENT from the Chapters 1-2 air pollution data) is given on the course web page. This R code will read in the data: USairpol.full <- read.table("http://people.stat.sc.edu/hitchcock/usair.txt", header=T) city.names <- as.character(USairpol.full[,1]) USairpol.data <- USairpol.full[,-1] USairpol.data$Temp <- (-USairpol.data$Temp) attach(USairpol.data) *These are the descriptions of the variables in the data set. These are each measured on 41 U.S. cities. SO2=sulphur dioxide content of air (a measure of air pollution) Temp=average annual temperature in degrees F Manuf=number of manufacturing enterprises employing 20 or more workers Pop=Population size (1970 census) in thousands Wind=Average annual wind speed in miles per hour Precip=Average annual precipitation in inches Days=Average number of days with precipitation per year (a) Use a regression tree approach with SO2 as the dependent (response) variable and the other variables as independent (explanatory) variables. (You can use the default settings of the 'rpart' function.) Show the plot of the tree. Based on the tree, which seems to be the most important explanatory variables to predict sulphur dioxide content? Use the regression tree to predict the SO2 for a city with Temp=60, Manuf=390, Pop=500, Wind=8.5, Precip=45, Days=110 (b) Use the random forest approach to do the same prediction as in part (a). Comment on any similarities and/or differences between your conclusions in part (a) and in part (b). What does the random forest approach tell you about the relative importance of the six various predictors in the prediction? (c) Use the K-nearest neighbors regression to do the same prediction as in part (a). Try a variety of values of K, such as 3, 5, and 10; and report how the predicted sulphur dioxide value changes. For each choice of K, provide a plot of Y-hat values vs. Y values and use these to comment of what your favored choice of K might be. Comment on any similarities and/or differences between your conclusions in parts (a), (b), and (c). PROBLEM 4: REQUIRED for GRADUATE STUDENTS, EXTRA CREDIT for UNDERGRADS -------------------------------------------------------------------------- Use the Sudden Infant Death Syndrome (SIDS) data given on the course web page. The "Group" variable designates 49 healthy, surviving infants (Group = 1) and 16 infants who were SIDS victims (Group = 2). The predictor variables were Heart Rate; Birthweight; Factor68 (a measurement based on recorded electrocardiograms and respiratory movements); and Gestational Age. The data can be read into R by: sidsdata <- read.table("http://people.stat.sc.edu/hitchcock/SIDSdata.txt", header=T) attach(sidsdata) Use the Support Vector Machine approach to classify a new baby with HR = 100, BW = 3000, Factor68 = 0.3, Gesage = 40 as into either the healthy group or the SIDS group. Comment on any choices of tuning parameters, settings, etc. that you used.