STAT 541 Homework 6

NOTE:  You MUST intersperse comments (lines that start with * and end with ; or lines that 
start with /* and end with */) in your code to explain 
what your SAS statements are supposed to be doing.  Please be generous with your 
comments, since you will be graded not only on the correctness of the code, but partially 
on the clarity of comments.

NOTE:  Submit your solution code via Blackboard (see course web page for instructions).  Please save your work as a 
 plain text file (e.g., a .txt file) and then submit that file in Blackboard.

NOTE: PLEASE put WITHIN COMMENTS any text (i.e., if you choose to include problem numbers, 
problem description, your personal comments, output/results) in your file that is not actual SAS code.  
This will make it easier and faster to grade.  The grader should be able to copy and paste your entire file into SAS and 
have it run correctly.

1. The following problem uses the sashelp.baseball data set.

(a) Calculate and report the mean number of Home Runs (nHome) for all the observations in the 
entire data set.  Call this number the "population mean" Home Runs.

(b) Write a program that will automatically select the first 50 odd-numbered observations in the data set
(i.e., the 1st, 3rd, 5th, ..., 99th observations).  Calculate and report the sample mean Home Runs (nHome)
for this selected systematic sample.

(c) Write a program that will automatically select the first 50 even-numbered observations in the data set
(i.e., the 2nd, 4th, 6th, ..., 100th observations).  Calculate and report the sample mean Home Runs (nHome)
for this selected systematic sample.

(d) Write a program that will randomly select 50 observations (WITH replacement) in the data set.

(e)  Do this sampling with replacement 10 times, each time calculating the sample mean Home Runs (nHome) 
for the random sample.  Report the 10 sample mean values you get.
(NOTE:  If you want to, you can automate the repeated sample-taking process (try if you can!), but you don't have to.  You could simply 
run 10 times a section of code that will take the sample, and just keep track of the 10 sample mean values you get.)

(f) Write a program that will randomly select 50 observations (WITHOUT replacement) in the data set.

(g)  Do this sampling without replacement 10 times, each time calculating the sample mean Home Runs (nHome) 
for the random sample.  Report the 10 sample mean values you get.
(NOTE:  If you want to, you can automate the repeated sample-taking process (try if you can!), but you don't have to.  You could simply run 10 times a section of code that will take the sample, and just keep track of the 10 sample mean values you get.)

(h) Which set of sample means (the set from (e) or the set from (g)) appear to be closer to 
the "population mean" from (a)?  
[There is not necessarily a "right" answer to this ... just report what you observe.]

2.  Use the following code that we used in class to create a SAS data set called 'mycities' that keeps the unique
U.S. city/state combinations in the sashelp.zipcode data set:

proc sort data=sashelp.zipcode nodupkey out=mycities;  /* Keeping only unique city-state observations */
by statecode city;
run;

(a) Write SAS code that will find and print out the 50 cities with the largest PROPORTION of vowels (a,e,i,o,u) 
in the city's name.  The proportion of vowels is the number of vowels in the city's name divided by the total 
number of letters in the city's name.  (In all cases, treat "y" as a consonant, not a vowel.)

(b) Write SAS code that will find and print out the 50 cities with the smallest PROPORTION of vowels (a,e,i,o,u) 
in the city's name.  (In all cases, treat "y" as a consonant, not a vowel.)

(c) Write SAS code that will find and print all the cities (along with their statecodes) in which their corresponding 
statecode (postal abbreviation) appears in the city's name.  This should be case-insensitive (ignoring 
uppercase vs. lowercase).  For example, the city of Scranton, South Carolina, should appear in the list, because 
"SC" appears in the city name "Scranton" (ignoring case).

(d) How many cities have their statecode appear in their name multiple separate times?  [Hint:  You should not 
have to manually hunt through the list from part (c) to answer this.]  Of all the cities in the U.S.,
which one has its state's postal abbreviation appear in its name the most times?