/* This is an example of the REG procedure in SAS */ /* This code will analyze data from a */ /* Simple Linear Regression (SLR) model */ /* The data given here are the lot size and work hours*/ /* from the example we studied in class */ /* I am calling the data set "toluca". */ /* The predictor variable is LotSize. */ /* The response variable is WorkHrs. */ data toluca; input LotSize WorkHrs; cards; 80 399 30 121 50 221 90 376 70 361 60 224 120 546 80 352 100 353 50 157 40 160 70 252 90 389 20 113 110 435 100 420 30 212 50 268 90 377 110 421 30 273 90 468 40 244 80 342 70 323 ; run; /* Let's do a scatter plot to see if a linear relationship is appropriate. */ /* Using PROC GPLOT gives a nice-looking plot of the data. */ /* This plot comes up in a separate window from the OUTPUT window. */ symbol1 v=circle l=32 c = black; PROC GPLOT DATA=toluca; PLOT WorkHrs*LotSize; RUN; /* The trend seems fairly linear, so we proceed in */ /* estimating the regression model. */ /* Let's use PROC REG to do estimate the slope and intercept of the linear model */ /* Let's also plot the estimated regression line over our scatterplot: */ /* The statement: MODEL WorkHrs = LotSize; tells SAS that */ /* the response variable is WorkHrs and the predictor variable is LotSize. */ PROC REG DATA = toluca; MODEL WorkHrs = LotSize/ clb; PLOT WorkHrs*LotSize='+' p.*LotSize='*' / overlay; OUTPUT OUT=NEW P=PRED; run; /* NOTE: */ /* 95% is the default confidence level */ /* to specify another confidence level, use (e.g., for 90%) */ * MODEL WorkHrs = LotSize/ clb alpha = 0.10; /*********************************************************************************/ /* So our least-squares regression line is Y-hat = 62.37 + 3.57 X */ /* How should we interpret these parameter estimates? */ /*********************************************************************************/ /* Our estimate of sigma^2, MSE, can be found in the ANOVA table as 2383.716. */ /* Note that our estimate of sigma, "Root MSE" as SAS calls it, is 48.823. */ /**********************************************************************************/ /* Find the 95% CI for beta_1 on the SAS output. */ /* Also the the p-value for the two-sided test for */ /* whether the true slope is zero. */ /* Compare the results in the SAS output to the results in the */ /* book (p. 45-47) for the Toluca Company example. */ /**********************************************************************************/ /* PROC GPLOT gives a nice plot of the points with the */ /* connected estimated regression line overlain on it. */ symbol1 v=circle l=32 c = black; symbol2 i = join v=star l=32 c = black; PROC GPLOT DATA=NEW; PLOT WorkHrs*LotSize PRED*LotSize/ OVERLAY; RUN; /**********************************************************************************/ /* Interval Estimation of E(Y_h) */ /* and Prediction Interval for a new Y_h */ /* Example 1: */ /* We want to estimate the mean number of work hours */ /* for lots of size 65 units, with a 90% CI. */ /* Example 2: */ /* We want to estimate the mean number of work hours */ /* for lots of size 100 units, with a 90% CI. */ /* Let's add the value X = 65 to our data set: */ /* We'll leave a missing value (.) for WorkHrs. */ /* And let's add the value X = 100 to our data set: */ /* We'll again leave a missing value (.) for WorkHrs. */ DATA Xvalues; INPUT LotSize WorkHrs; CARDS; 65 . 100 . ; DATA toluca; SET toluca Xvalues; ; /* The option clm will give us CIs for the mean of Y, */ /* for the values of X (the sizes) in the data set. */ /* the option alpha=0.10 tells SAS we want a 90% CI. */ PROC REG DATA = toluca; MODEL WorkHrs = LotSize / clm alpha = 0.10; RUN; /* Look at the last 2 lines of the "Output Statistics". */ /* These show the predictions and confidence intervals */ /* for the lots which were 65 units and 100 units. */ /* We see the fitted values for these observations were */ /* 294.4 hours and 419.4 hours, respectively. */ /* The 90% CI for mean number of work hours for all lots */ /* of size 65 units is between 277.4 and 311.4 hours. */ /* The 90% CI for mean number of work hours for all lots */ /* of size 100 units is between 394.9 and 443.9 hours. */ /* Finding Coefficient of Determination and */ /* Correlation Coefficient in SAS. */ /* R-square is given as part of the regression output. */ /* It is 0.8215. Note that r is simply the square root */ /* of R^2, with the same sign as the estimated slope (positive, */ /* here). So what is r here? */ /* We can also find the correlation coefficient in SAS using PROC CORR: */ PROC CORR DATA = toluca; VAR WorkHrs LotSize; run; /* So we see from the output that r is 0.90638.*/ /* ********************************************************************* */ /* MISCELLANEOUS */ /* How can we test whether the true slope is equal to some particular */ /* NONZERO value? We can use a TEST statement in PROC REG. */ /* For example, we can test H0: beta_1 = 3 in the Toluca data example */ /* against the TWO-sided alternative. */ /* In the TEST statement, we specify the name of the predictor variable */ /* and the hypothesized value of the slope: */ PROC REG DATA = toluca; MODEL WorkHrs = LotSize / clb; TEST LotSize = 3; RUN; /* Since the P-value is 0.1139, we would fail to reject the above H0 */ /* at, say, a 0.05 significance level. */ /* To find the t* value of this test (the test statistic), you must */ /* take the square root of the given "F-value", 2.70. */ /* So t* = 1.64. */