STAT 530, Fall 2024 -------------------- Homework 7 ----------- ALL students should do both of the following problems: IMPORTANT NOTE: For EACH of these problems, write a couple of sentences explaining in words what substantive conclusions about the data that you can draw from the plots and/or analyses. PROBLEM 1: --------------- Use Hotelling's T^2 test and the data in the test score data set (scores on math and reading tests given to a sample of girls and a sample of boys) to test for a difference in the mean score vector of the boys and the mean vector of the girls. The following R code will read in the data: testdata <- read.table("http://people.stat.sc.edu/hitchcock/testscoredata.txt", header=T) attach(testdata) testdata.noIDs <- testdata[,-1] #to remove the ID numbers PROBLEM 2: --------------- Consider the 'hsb' data set that we have studied in class. Suppose our goal is to compare the mean vectors (where the variables are the scores on: read, write, math, science, socst) among the different levels of 'ses' (high, middle, and low socioeconomic classes). hsb <- read.table("http://people.stat.sc.edu/hitchcock/hsbdata.txt", header=T) attach(hsb) hsb.prob4 <- hsb[,c(5,8,9,10,11,12)] hsb.numeric <- hsb[,c(8,9,10,11,12)] ############################################### (a) Conduct the MANOVA F-test using Wilks' Lambda to test for a difference in (read, write, math, science, socst) mean vectors across the three ses classes. Use a 0.05 significance level, and give the P-value of the test. (b) Check to see whether the assumptions of your test are met. Do you believe your inference is valid? (c) Examine the sample mean vectors for each group. Informally comment on the differences among the groups in terms of the specific variables. PROBLEM 3: ---------------------------- Read in the complete 2015 baseball data set (with all variables, not just the ones we studied in class) with this code: bat2015 <- read.csv("http://www.stat.sc.edu/~hitchcock/baseball2015batting.txt", header=T) pitch2015 <- read.csv("http://www.stat.sc.edu/~hitchcock/baseball2015pitching.txt", header=T) baseball2015 <- merge(bat2015,pitch2015,by="Tm") Suppose a team's manager and general manager are interested in controllable variables that have an effect on certain key outcome variables such as: "Runs Per Game" (they want this to be large); "Runs Allowed Per Game" (they want this to be small); and "Team Winning Percentage" (they want this to be large). Response variables associated with these outcomes for the 2015 baseball data can be read in with the code below. They identified five strategies/tactics that they could consider using, and they want to see whether these have an effect on the outcome variables: -They could build a roster of older (or younger) batters. -They could build a roster of older (or younger) pitchers. -They could use aggressive/risky baserunning tactics, which would lead to lots of stolen bases. -They could tell their pitchers to pitch carefully, to avoid giving up bases on balls (walks). -They could tell their pitchers to pitch aggressively, to earn lots of strikeouts. Predictor variables associated with these tactics for the 2015 baseball data can be read in with the code below. #Response Variables: y1 <- baseball2015$RG # Runs Per Game y2 <- baseball2015$RAG #Runs Allowed Per Game y3 <- baseball2015$WLP # Team WinningPct #Predictor Variables: x1<-baseball2015$BatAge # Average age of team's batters x2<-baseball2015$PAge # Average age of team's pitchers x3<-baseball2015$SB # Total Stolen bases by team x4<-baseball2015$BB.y # Total bases on balls (walks) given up by team's pitchers x5<-baseball2015$SO.y # Total strikeouts earned by team's pitchers (a) Build a multivariate regression model that relates the three response variables to the five predictor variables. Give a summary of the estimated Beta matrix of regression coefficients. Based on the estimated coefficients (and their corresponding test statistics / P-values), which predictor variables seem to have an important effect on which response variables? Write a few sentences characterizing the nature of the effect of these predictor variable(s) on the response variable(s). (b) Suppose the managers decide to build a relatively older roster of batters and a young roster of pitchers and they choose relatively aggressive baserunning and pitching tactics. Use the multivariate regression model to predict the "Runs Per Game"; "Runs Allowed Per Game"; and "Team Winning Percentage" if the Average age of batters=29, Average age of pitchers=26, Total Stolen bases=120, Total bases on balls=550, and Total strikeouts=1400. You can just report point predictions for the three outcome variables (a single number for each outcome variable). Based on your prediction, will this strategy pay off in success for the team? EXTRA CREDIT for UNDERGRADS, REQUIRED for GRADUATE STUDENTS: - Suppose the team wanted to see whether a simpler model, using only the two "pitching tactics" variables (x4 and x5), would be sufficient to predict the three outcome variables well. Perform a formal test to determine whether the simpler model is sufficient, or whether the more complex model is needed. Use plots of the residual vectors to verify that the model assumptions are satisfied with this data set.