Stat 518 - Fall 1999 - SAS Templates

Class Notes from 9/2/99
Homework Three Notes
Homework Four Notes
Homework Five Notes
Homework Six Notes
Homework Seven Notes
Homework Eight Notes


The Basics of SAS:

Hitting the [F3] key will run the program currently in the Program Editor window.

This will however erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose "Clear text" under the Edit menu.

The following is the SAS code for analyzing the SAT data presented in the August 31st, 1999 New York Times:


OPTIONS pagesize=60 linesize=80;

DATA sat1;
INPUT state $ verbal math pct;
LABEL state = "State"
   verbal = "Average Verbal Subtest Score"
   math = "Average Math Subtest Score"
   pct = "Percent of Students Taking Test";
CARDS;
Ala. 561 555 9
Alaska 516 514 50
Ariz. 524 525 34
Ark. 563 556 6
Calif. 497 514 49
Colo. 536 540 32
Conn. 510 509 80
Dela. 503 497 67
D.C. 494 478 77
Fla. 499 498 53
Ga. 487 482 63
Hawaii 482 513 52
Idaho 542 540 16
Ill. 569 585 12
Ind. 496 498 60
Iowa 594 598 5
Kan. 578 576 9
Ky. 547 547 12
La. 561 558 8
Maine 507 507 68
Md. 507 511 65
Mass. 511 511 78
Mich. 557 565 11
Minn. 586 598 9
Miss. 563 548 4
Mo. 572 572 8
Mont. 545 546 21
Neb. 568 571 8
Nev. 512 517 34
N.H. 520 518 72
N.J. 498 510 80
N.M. 549 542 12
N.Y. 495 502 76
N.C. 493 493 61
N.D. 594 605 5
Ohio 534 538 25
Okla. 576 560 8
Ore. 525 525 53
Pa. 498 495 70
R.I. 504 499 70
S.C. 479 475 61
S.D. 585 588 4
Tenn. 559 553 13
Texas 494 499 50
Utah 570 565 5
Vt. 514 506 70
Va. 508 499 65
Wash. 525 526 52
W.Va. 527 512 18
Wis. 584 595 7
Wyo. 546 551 10
;
Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The "OPTIONS" line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the "Options" menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The "DATA" line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The "INPUT" line gives the names of the variables, and they must be in the order that the data will be entered. The $ after "state" on the "INPUT" line means that the variable "state" is qualitative instead of quantitative.

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.




PROC PRINT data=sat1;
TITLE "September 1, 1999 - SAT Report";
RUN;
The basic method for getting a summary of the data is to use PROC UNIVARIATE.


PROC UNIVARIATE DATA=sat1 PLOT FREQ ;

VAR pct ;
TITLE 'Summary of the Percent of Students Taking the SAT';
RUN;
The "VAR" line says which of the variables you want a summary of. Note that there are many different definitions of percentile, and the exact value may not be the same as we saw how to calculate in class.

PROC insight allows many of these analyses, as well as many more advanced analyses and nicer graphs. While it is possible to change the definitions of the percentiles in PROC UNIVARIATE, you can not do so in the current editions of PROC INSIGHT.


PROC INSIGHT; 

OPEN sat1;
DIST pct;
RUN;

You can cut and paste the graphs from PROC INSIGHT right into microsoft word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet.

One very useful ability in PROC INSIGHT is the ability to make new variables from the old ones. This is done by going to the "Edit" menu, and selecting "Variables", and then "Other...". This can be used for example to make the z-scores.

There are a variety of seemingly commonsense procedures that you would think SAS would be good at. Unfortunately it either hides them well or doesn't do them for some reason. Luckily we can program SAS to do some of these. The following example finds a confidence interval for the variance. The function CINV looks up the value on the chi-square table that goes with the percentage and the degrees of freedom you give it. The 0.05 in the problem is the alpha from (1-alph)*100%.


PROC MEANS NOPRINT DATA=sat1;

VAR verbal;
OUTPUT OUT=temp STD=sd N=n;
RUN;
DATA temp2;
SET temp;
KEEP var n alpha cilow cihigh;
INPUT alpha;
var = sd*sd;
df = n - 1;
cilow = (n-1)*(var)/CINV(1-(alpha/2),df);
cihigh = (n-1)*(var)/CINV(alpha/2,df);
CARDS;
.05
;
PROC PRINT data=temp2;
RUN;
The mean could also have been kept on the "OUTPUT" line using "MEAN=xbar" for example. To make a confidence interval using a t or normal distribution, the functions to "look up the values in the table" would have been TINV or PROBIT respectively. To get the p-value to form a test of hypothesis, we could use the functions: PROBCHI for chi-square, PROBT for t, and PROBNORM for normal.

To quit, simply choose Exit in the File menu for each program, and use CTRL+ALT+DEL to logoff the machine.


Notes on Homework Three:

Page 133, #2: SAS performs many of the more complicated nonparametric procedures automatically using PROC NPAR1WAY. To be expected however, it doesn't do many of the simpler ones automatically. The following code will perform the "<" binomial test for the data in Example 1 on page 127.
 
DATA temp;

INPUT y n p;
pval = PROBBNML(p,n,y);
CARDS;
3 19 0.5
;

PROC PRINT data=temp;
RUN;

The code to do the alternative hypothesis > would be to replace the current pval line with "pval=1-PROBBNML(p,n,y-1)". The code to do the two sided test is somewhat more problematic as discussed in class, but could be approximated by using "pval=2*min(PROBBNML(p,n,y),1-PROBBNML(p,n,y-1))".

Page 133, #4: The code for calculating the confidence intervals is very similar to what is used on the Splus page. Unlike the Splus however this is not a function and we have to re-enter the whole thing again every time we want to run it. The example below uses y=3, n=19, and makes a data set temp2 that contains the 95% confidence interval.


DATA temp2;

INPUT y n alpha;
l = y / (y+ (n - y + 1) * FINV(1 - alpha/2,2*(n - y + 1),2*y));
u = 1 - ((n-y) / (n - y + (y + 1) * FINV( 1 - alpha/2,2*(y+1),2*(n - y))));
CARDS;
3 19 0.05
;
Page 164, #2: If we actually had all of the data, then we could use PROC INSIGHT to get the result. First use _Edit_ _Variables_ _Other_ to form the differences of the two variables, and then use the sign test option under _Tables_ _Location Tests_. If we are simply given the counts, then we would enter the correct choice of y and n into the code above for the binomial test with p=0.5.


Notes on Homework Four:

The easiest way to get a confidence interval for the mean, or a q-q plot, is to use PROC INSIGHT. Say we wanted to get an 90% CI for the PCT in the SAT data at the top of this page, and a q-q plot to make sure the data is normal. First we would have to enter the data (copying in the first batch of code at the top of this web page.) Secondly we would have to start up PROC INSIGHT (the fourth set of code at the top of this web page). Once PROC INSIGHT is started, go to "Tables" menu, and go down to the choice "C.I for Mean" choice, and finally simply select the percentage you want. It should appear at the bottom of the insight window, and you can cut the box out and paste it into microsoft word. To get the q-q plot, go to the "Graphs" menu and select "QQ Plot...". The default setting is normal, so just hit ok. To add the line to the plot, go to the "Curves" menu and select "QQ Ref Line...". Again the default setting of "Least Squares" turns out to be what we want, so just hit ok again. You can also put this graph in microsoft word simply by cutting and pasting it over.

PROC INSIGHT will also do a two-sided t-test for the mean. This is the "Location Tests" option under the "Tables" menu. The parameter line is where you put the value given by the null hypothesis. For some reason SAS doesn't automatically do the one sided tests. What we are doing below are the steps needed to test that the mean pct is 52:

1) Calculate the MEAN, SD, and get the number of values for the variable pct in the data set sat1.
2) Output these values into a data set called temp, because we'll only use it temporarily
3) Using the data set temp, and the mean entered after the cards statement, calculate t = (xbar-mu)/(sd/sqrt(n))
4) Calculate the p-value for the three different alternative hypotheses. The function probt(t,df) calculates the area less than t in a t-distribution with df degrees of freedom
5) Put this information in another temporary data set called temp2, and print it out

 

PROC MEANS NOPRINT DATA=sat1;

VAR pct;
OUTPUT OUT=temp MEAN=xbar STD=sd N=n;
RUN;
DATA temp2;
SET temp;
KEEP xbar mu sd n t pgreater pless ptwoside;
INPUT mu;
t = (xbar-mu)/(sd/sqrt(n));
df = n - 1;
pgreater = 1 - probt(t,df);
pless = probt(t,df);
ptwoside = 2*MIN(1-ABS(probt(t,df)),ABS(probt(t,df)));
cards;
52
;
PROC PRINT;
RUN;
Similarly, SAS also does not have a built in way of performing the quantile test or of forming the quantile test. It is also not nearly as easy to program as S-Plus. PROC SORT simply puts the data in sat1 in order from smallest to biggest pct value, and then makes a new dataset called satrnk. The second batch of code takes satrnk and makes a new data set with two variables, "loreq" and "l", where each state with pct less than or equal to 52 gets a 1 for "loreq" and gets a 1 for "l" if pct is less than 52. The 51 there is the size of the data set. The third set of code then just adds these up and makes the dataset ts that contains T1, T2, and n. The final portion calculates the p-values for the given "pstar" and prints them out.

PROC SORT DATA=sat1 OUT=satrnk;

BY pct;
RUN;
 
DATA sattmp;
SET satrnk;
DROP n;
KEEP loreq l pct;
DO n=1 TO 51;
loreq = (pct<=52);
l = (pct<52);
END;
RUN;
 
PROC UNIVARIATE DATA=sattmp NOPRINT;
VAR loreq l;
OUTPUT OUT=ts SUM= T1 T2 N=n;
RUN;
 
DATA pvals;
SET ts;
KEEP T1 T2 n pstar greater less twoside;
INPUT pstar;
greater = PROBBNML(pstar,n,T1);
less = 1 - PROBBNML(pstar,n,T2-1);
twoside = 2*MIN(PROBBNML(pstar,n,T1),1 - PROBBNML(pstar,n,T2-1));
CARDS;
0.5
;
PROC PRINT;
RUN;

As SAS does not have a function to look up the quantiles of the binomial, we would need to write one to get SAS to give us the confidence book in the manner that the book does for small sample sizes. It is possible however to get SAS to give the r and s values using formulas (21) or (22) on page 44, and the actual alpha level that the confidence interval will have. CEIL is the function that rounds a number up, and PROBIT gives the quantile from the normal table. However, it then requires a bunch of manipulation to get SAS to tell you what the rth and sth value of the data set are... so in this case its probably easier to do the last bit by hand. The below will print out the data in order, and tell you which values to take for the large sample approximation, and then you could do the last step by hand.


PROC SORT DATA=sat1 OUT=satrnk;

BY pct;
RUN;
PROC PRINT;
RUN;

DATA findrs;
INPUT pstar n conflvl;
alpha=1-conflvl;
rstar=CEIL(n*pstar+PROBIT(alpha/2)*SQRT(n*pstar*(1-pstar)));
sstar=CEIL(n*pstar+PROBIT(1-(alpha/2))*SQRT(n*pstar*(1-pstar)));
actual=PROBBNML(pstar,n,sstar-1)-PROBBNML(pstar,n,rstar-1);
CARDS;
.50 51 .9
;
PROC PRINT;
RUN;


Notes on Homework Five:

1) Luckily SAS does have built in functions to conduct the t-test, sign test, and signed-rank test. They can all be run in PROC INSIGHT. Say we wanted to test that the average state verbal scores were greater than the average state math scores. We first need to make the variable with the differences. Click on the spread sheet, and then go to the "Edit" menu, and then choose "Variables" and in that menu take "Other...". We want to take the difference, so first click on Y-X in the Transformation list. Then pick which variable you want to be X, click on that, and then click on the X box. Do the same for Y, and then hit ok. A new variable should have appeared in the spreadsheet.

Once we have the difference in the spreadsheet, go up to the "Analyze" menu, select "Distribution (Y)", and choose the difference variable for the Y and hit "OK". We could now get the Q-Q plot for the differences, or make the confidence intervals using the selections in the various menus. To do the hypothesis tests, choose "Location Tests..." under the "Tables" menu. Here simply check all the appropriate boxes, pick which value you are testing the median/mean equals, and click ok. This will add the three tests to the INSIGHT graphics window. Note that the p-values given here are for the two-sided test. To get the one-sided p-value simply go through the procedure we discussed in class.

4) SAS also has the Mann-Whitney-Wilcoxon Rank-sum test built in. To use this you have to use PROC NPAR1WAY. The code below will perform the test for the data in problem 5 on page 287. Note that we have to enter the data in a slightly different manner, saying which group each value is in.


DATA gundata;

INPUT group $ scores;
CARDS;
A 96
A 93
A 88
A 85
B 89
B 83
B 80
B 77
;

PROC NPAR1WAY WILCOXON DATA=gundata;
CLASS group;
VAR scores;
RUN;
Two things to note here are that this is the two-sided p-value, and that this value is from the normal approximation.


Notes on Homework Six:

2) SAS has a built in procedure for determining various types of correlations. PROC CORR. Lets say we wanted to find the various types of correlations between the SAT verbal and math scores given in the data set at the top of this page. Note that it may give slightly different values than the methods in the book due to differing ways of dealing with ties.


PROC CORR DATA=sat1 KENDALL SPEARMAN;

VAR verbal math;
RUN;

4) PROC NPAR1WAY will conduct the Kruskall-Wallis test in the same way that it does the Mann-Whitney-Wilcoxon Rank-Sum test. Enter the data just as in problem 4 of homework 5, except that you will have more groups. The command for running PROC NPAR1WAY does not change at all (you still use the command WILCOXON).

PROC ANOVA can conduct the ANOVA for this type of data, and the following code could be used. The dollar sign is used to indicate that the group is a name and not necesarily a number. The following code does the work for example 1 on page 291.


DATA examp;

INPUT group $ value @@;
CARDS;
1 83 2 91 3 101 4 78
1 91 2 90 3 100 4 82
1 94 2 81 3 91 4 81
1 89 2 83 3 93 4 77
1 89 2 84 3 96 4 79
1 96 2 83 3 95 4 81
1 91 2 88 3 94 4 80
1 92 2 91 4 81 1 90
2 89 2 84
;
PROC ANOVA DATA=examp;
CLASS group;
MODEL value=group;
RUN;

5) To get SAS to do Friedman's test you need to use PROC FREQ. the following code works through example 1 on page 372. For PROC FREQ to work correctly, the first variable must be the blocking variable, the second must be the treatment, and the third must be the observed value. In the output, we are looking at the alternate hypothesis that "Row Mean Scores Differ".

DATA examp;

INPUT home $ grass $ value;
CARDS;
1 1 4
1 2 3
1 3 2
1 4 1
2 1 4
2 2 2
2 3 3
2 4 1
3 1 3
3 2 1.5
3 3 1.5
3 4 4
4 1 3
4 2 1
4 3 2
4 4 4
5 1 4
5 2 2
5 3 1
5 4 3
6 1 2
6 2 2
6 3 2
6 4 4
7 1 1
7 2 3
7 3 2
7 4 4
8 1 2
8 2 4
8 3 1
8 4 3
9 1 3.5
9 2 1
9 3 2
9 4 3.5
10 1 4
10 2 1
10 3 3
10 4 2
11 1 4
11 2 2
11 3 3
11 4 1
12 1 3.5
12 2 1
12 3 2
12 4 3.5
;

PROC FREQ DATA=examp;
TABLES home*grass*value / noprint cmh;
RUN;
In the PROC ANOVA output, you want to look at the row that has the name of the treatment variable.

PROC ANOVA DATA=examp;

CLASS home grass;
MODEL value = home grass;
RUN;


Notes on Homework Seven:

Chi-squared tests: The following code will work Example 1 on page 202.


DATA privpub;

INPUT type $ score $ count;
CARDS;
priv a_low 6
priv b_medlow 14
priv c_medhi 17
priv d_hi 9
pub a_low 30
pub b_medlow 32
pub c_medhi 17
pub d_hi 3
;

PROC FREQ DATA=privpub;
WEIGHT count;
TABLES type*score /chisq expected nopercent;
RUN;
Note that section 4.1 of the text on page 180 calls it the chi-squared test, but actually calculates a standard normal statistic instead. If you use the above code to replicate Example 1 on Page 183 you would get a statistic of 1.6116 (= -1.2695 squared) and a p-value of 0.2043.

McNemar's test: McNemar's test is a special case of another test (the Cochran Test of section 4.6), and the SAS manual discusses how to get the test performed by a rather round about method unless you haven't already determined what the values in the table should be. It is easier just to put in the two corner cells and do a chi-square test for equal proportions. The following code will perform McNemar's test on the data in example 1 on page 168.


DATA party;

INPUT cell $ count;
CARDS;
b 21
c 4
;

PROC FREQ DATA=party;
WEIGHT count;
TABLES cell / chisq expected nopercent;
RUN;


Notes on Homework Eight:

Page 238, #1: It may be just as easy to do this one by hand... (Note that for the definitions of the phi-coefficient and Yule's statistics that we have, we only want to do this on a 2x2 table.) By adding the command measures on the TABLES line along with chisq you will get several measures of association. The following code is for the data in Example 7 on page 236.

    
DATA ex7pg236;
INPUT wear $ dead $ count;
CARDS;
yes	yes	7
yes	no	89
no	yes	24
no	no	122
;

PROC FREQ DATA=ex7pg236;
 WEIGHT count;
 TABLES wear*dead / chisq expected nopercent measures;
RUN;

The Chi-square value (T in the texts notation), Cramer's Coefficient, and the Phi Coefficient are all labeled in that way. The odds ratio is on the line labeled Case-Control in the section on Estimates of Relative Risk. Yule's Q is on the line labeled Gamma. It is probably easiest to calculate the coefficient of colligation by hand.