Stat 518 - Fall 2007 - SAS Templates

Class Notes from 8/28/07
One Sample Location Tests
Section 3.5: McNemar Test


The Basics of SAS:

SAS is one of the most widely used statistical and data base management programs. A one year license can be gotten from University Technology Services (1244 Blossom Street) for $60 annually. Bring your own CD-Rs because they charge $10 per CD if you don't!

There are three main windows that are used in SAS. The Log window, the Program Editor window, and the Output window. The Log and Program Editor window are the two on the screen when you start up the program. The Output window isn't visible yet because you haven't created anything to output. If you happen to lose one of these windows they usually have a bar at the bottom of the SAS window. You can also find them under the View menu.

The Program Editor is where you tell SAS what you want done. The Output window is where it puts the results, and the Log window is where it tells you what it did and if there are any errors. It is important to note that the Output window often gets very long! You usually want to copy the parts you want to print into MS-Word and print from there. It is also important to note that you should check the Log window everytime you run anything. The errors will appear in maroon. Successful runs appear in Blue.

Hitting the [F3] key at the top of your keyboard will run the program currently in the Program Editor window. You can also run programs by clicking on the little image of the runner in the list of symbols near the top of the SAS program screen.

In older editions of SAS, running the program will erase whatever was written in the Program Editor window. To recall whatever was there, make sure you are in that window, and hit the [F4] key.

If you keep running more programs it will keep adding it all to the Output window. To clear the Output window, make sure you are in that window, and choose Clear text under the Edit menu.

If you happen to lose a window, try looking under the View menu.

One way of entering data into SAS is in a data step. The following is the SAS code for analyzing the SAT data presented in the August 31st, 1999 New York Times:

OPTIONS pagesize=60 linesize=80;

DATA sat1;
INPUT state $  verbal math pct;
LABEL state = "State"
   verbal = "Average Verbal Subtest Score"
   math = "Average Math Subtest Score"
   pct = "Percent of Students Taking Test"; 
CARDS;
        Ala.    561  555    9
        Alaska  516  514   50
        Ariz.   524  525   34
        Ark.    563  556    6
        Calif.  497  514   49
        Colo.   536  540   32
        Conn.   510  509   80
        Dela.   503  497   67
        D.C.    494  478   77
        Fla.    499  498   53
        Ga.     487  482   63
        Hawaii  482  513   52
        Idaho   542  540   16
        Ill.    569  585   12
        Ind.    496  498   60
        Iowa    594  598    5
        Kan.    578  576    9
        Ky.     547  547   12
        La.     561  558    8
        Maine   507  507   68
        Md.     507  511   65
        Mass.   511  511   78
        Mich.   557  565   11
        Minn.   586  598    9
        Miss.   563  548    4
        Mo.     572  572    8
        Mont.   545  546   21
        Neb.    568  571    8
        Nev.    512  517   34
        N.H.    520  518   72
        N.J.    498  510   80
        N.M.    549  542   12
        N.Y.    495  502   76
        N.C.    493  493   61
        N.D.    594  605    5
        Ohio    534  538   25
        Okla.   576  560    8
        Ore.    525  525   53
        Pa.     498  495   70
        R.I.    504  499   70
        S.C.    479  475   61
        S.D.    585  588    4
        Tenn.   559  553   13
        Texas   494  499   50
        Utah    570  565    5
        Vt.     514  506   70
        Va.     508  499   65
        Wash.   525  526   52
        W.Va.   527  512   18
        Wis.    584  595    7
        Wyo.    546  551   10
;

Note that _most_ lines end with a semi-colon, but not all. SAS will crash if you miss one, but usually the log window will tell you where the problem is.

The OPTIONS line only needs to be used once during a session. It sets the length of the page and the length of the lines for viewing on the screen and printing. The font can be set by using the Options sub-menu of the Tools menu along the top of the screen. When you cut and paste from SAS to a word processor, the font Courier New works well.

The DATA line defines what the name of the data set is. The name must be eight characters or less, with no spaces, and only letters, numbers, and underscores. It must start with a letter. The INPUT line gives the names of the variables, and they must be in the order that the data will be entered. The $ after state on the INPUT line means that the variable state is qualitative instead of quantitative.

If we hit F3 at this point to enter what we put above, nothing new will appear on the output screen. This is no big surprise however once we realize that we haven't told SAS to return any output! The code below simply tells SAS to print back out the data we entered.

PROC PRINT data=sat1;
TITLE "September 1, 1999 - SAT Report";
RUN;
The basic method for getting a summary of the data is to use PROC UNIVARIATE.

PROC UNIVARIATE DATA=sat1 PLOT FREQ ;
VAR pct ;
TITLE 'Summary of the Percent of Students Taking the SAT';
RUN;
The VAR line says which of the variables you want a summary of. Note that there are many different definitions of percentile, and the exact value may not be the same as we will calculate in class.

PROC INSIGHT allows many of these analyses, as well as many more advanced analyses and nicer graphs.

PROC INSIGHT; 
OPEN sat1;
DIST pct;
RUN;

You can cut and paste the graphs from PROC INSIGHT right into microsoft word. Simply click on the border of the box you want to copy with the right mouse button to select it. You can then cut and paste like normal. Clicking on the arrow in the bottom corner of each of the boxes gives you options for adjusting the graphs format. In the box at the bottom left of the histogram, choose Ticks... and set Tick Increment to 5.

One very useful ability in PROC INSIGHT is the ability to make new variables from the old ones. This is done by going to the Edit menu, and selecting Variables, and then Other.... This can be used for example to make the z-scores. Click on pct and then Y. This will select the pct variable. Then scroll down in the Transformation: window to select ( Y - mean( Y )) / std( Y ). Now click Ok and a new column containing the Z-scores will be added to the spreadsheet.

Both the z-scores and the graphs will automatically change if you adjust one of the values in the spreadsheet. For example, replace the pct for Ala. with 900.

The graphics in PROC INSIGHT are interactive as well. Left-click on the dot at the far right of the dot plot and it will be identified as observation one. Double clicking on it will reveal additional information.

To quit PROC INSIGHT, click on the X in the upper right portion of the spreadsheet. Note that INSIGHT can also be started using the choice Solutions -Analysis-Interactive Data Analysis in SAS version 8.

The basic DATA step in SAS also gives a great deal of flexibility in manipulating the data set. For example:

DATA sat2;
SET sat1;
KEEP state pct pct2;
pct2=(pct/100)**2;
WHERE pct<50;
;
PROC PRINT;
RUN;

SET specifies the previous data set to be modified, KEEP says what variables to put in the new data set, the line with pct2= describes how to create that new variable, and WHERE specifies which observations to keep.

There are a variety of seemingly commonsense procedures that you would think SAS would be good at. Unfortunately it either hides them well or doesn't do them for some reason. Luckily we can program SAS to do some of these. The following example finds a confidence interval for the variance. The function CINV looks up the value on the chi-square table that goes with the percentage and the degrees of freedom you give it. The 0.05 in the problem is the alpha from (1-alph)*100%.

PROC MEANS NOPRINT DATA=sat1;
VAR verbal;
OUTPUT OUT=temp STD=sd N=n;
RUN;
DATA temp2;
SET temp;
KEEP var n alpha cilow cihigh;
INPUT alpha;
var = sd*sd;
df = n - 1;
cilow = (n-1)*(var)/CINV(1-(alpha/2),df);
cihigh = (n-1)*(var)/CINV(alpha/2,df);
CARDS;
.05
;

PROC PRINT data=temp2;
RUN;
The mean could also have been kept on the OUTPUT line using MEAN=xbar for example. To make a confidence interval using a t or normal distribution, the functions to "look up the values in the table" would have been TINV or PROBIT respectively. To get the p-value to form a test of hypothesis, we could use the functions: PROBCHI for chi-square, PROBT for t, and PROBNORM for normal. (It should be noted that the newest version of INSIGHT does do the confidence interval for the variance under the Tables menu.)

To quit, simply choose Exit in the File menu for each program, and use CTRL+ALT+DEL to logoff the machine.


One Sample Location Tests

The t-test, quantile test for p=0.5, and Wilcoxon Signed Rank test can be carried out in either PROC INSIGHT or PROC UNIVARIATE. The following code does the analysis for the data in example 1 on page 355:

DATA examp1;
INPUT first second;
d=second-first;
CARDS;
86	88
71	77
77	76
68	64
91	96
72	72
77	65
91	90
70	65
71	80	
88	81	
87	72
;

PROC UNIVARIATE DATA=examp1 LOCATION=0;
VAR d;
RUN;

The output is under the section called "Tests for Location". Notice that it calculates the sign (e.g. quantile with p=0.5) and signed rank tests slightly differently than either R or the text book, and therefore the p-values will not match exactly. The confidence interval for the t-test and median (quantile=0.5) can be produced using the options: CIBASIC(ALPHA=0.05) CIPCTLDF(ALPHA=0.05). Unfortunately SAS has no easy way to make the confidence interval that goes with the Signed Rank test, and it is likely easier to just use R.


McNemar Test:

The McNemar test in section 3.5 can be carried out using PROC FREQ. The following code enters the data and performs the analysis for example 1 on page 168.

DATA election;
INPUT before $ after $ count;
CARDS;
Dem Dem 63
Dem Rep 21
Rep Dem 4 
Rep Rep 12
;
PROC FREQ DATA=election;
TABLE before*after;
EXACT MCNEM;
WEIGHT count;
RUN;

Note that it gives both the exact p-value (using T2 and the binomial test) and the large sample approximation (using T1 and the standard normal distribution, without any continuity correction).