STAT 440/540: Data analysis example using techniques from Chapter 3
The data set we consider is the time-to-failure in minutes of 50 electrical components. Each component was manufactured using a ratio of two types of materials; this ratio was fixed at 0.1, 0.2, 0.3, 0.4, and 0.5. Ten components were observed to fail at each of these manufacturing ratios in a designed experiment. It is of interest to stochastically model the failure-time as a function of the ratio, to determine if a significant relationship exists, and if so to describe the relationship simply.
We start from the point of having the data loaded into Minitab worksheet columns labeled x and y. A scatterplot is obtained by choosing “Graph” and “Plot…” and by picking the appropriate x and y variables and clicking “OK.” The standard assumptions of linear mean function and constant variance in the model Yi = b0 + b1 Xi + ei with the ei ~ N(0, s2) are immediately suspect:
The error variance decreases as the predictor values increase and we can use the Breusch-Pagan test to formally test the constant variance assumption. Recall that this test assumes the full model Yi = b0 + b1 Xi + ei, where the ei are independent normal random variables with mean zero and variance exp(g0 + g1 Xi). We test H0: g1 = 0 in this model as follows. We click on the “Session” window and then check “Enable Commands” under “Editor.” We can now type in Minitab commands directly. First we regress y on x, note the SSE, and store the raw residuals. We need the sum of squares regression, SSR*, from regressing the raw squared residuals (ei)2 (from our fit of the standard regression model) versus the Xi. The test statistic is given by (SSE*/2) / (SSE/n)2 and has an approximate X2 with 1 df when the null hypothesis holds. See page 115 for more detail. Here we regress y on x and store the raw residuals in a column called raw:
MTB > name c6 'raw' c7
'rawSQ'
MTB > regress y 1 x;
SUBC> residuals raw.
Analysis of Variance
Source DF SS
MS F
P
Regression 1
1279453 1279453 29.88
0.000
Residual Error 48
2055067 42814
Total 49 3334520
Now we regress the squared raw residuals on x
and note the SSR*:
MTB > let rawSQ = raw**2
MTB > regress rawSQ 1 x.
Analysis of Variance
Source DF SS
MS F P
Regression 1 89123428186 89123428186 8.50
0.005
Residual Error 48 5.03351E+11 10486480769
Total 49 5.92475E+11
Here we assign the constant k1 to be
the Bruesch-Pagan test statistic, find k2 = P(X2 £ k1)
and take k3 = P(X2
> k1), the p-value for the test:
MTB > let k1 =
(89123428186/2)/((2055067/50)**2)
MTB > CDF k1 k2;
SUBC> ChiSquare 1.
MTB > let k3 = 1 - k2
MTB > print k3
K3 0.000000281
The p-value for this test is zero to 6 decimal places and we have extremely strong evidence of non-constant variance assuming other assumptions are met (they’re not; linear mean structure is obviously questionable here too, but we’ve got to start someplace.)
A transformation of x alone will not help the non-constant variance and we’ll concentrate on a power transformation of y first. Of the prototype plots on page 130, this data somewhat resembles figure 3.15 (b). We might try any one of the three suggested transformations of y, but will first try Atkinson’s constructed variable test for some guidance. The following constructs Atkinson’s variable and performs the test; the relevant output follows:
MTB > name c3 'w'
MTB > let w = y*(loge(y)-mean(loge(y))-1)
MTB > regress y 2 x w
Predictor Coef SE Coef T P
Constant 125.57 14.41 8.71 0.000
x -249.14
41.26 -6.04 0.000
w 0.372809
0.009234 40.38 0.000
The p-value for testing H0: b2 = 0 in the model Yi = b0 + b1 Xi + b2 Wi + ei, and equivalently H0: r = 1 in the model (Yi)r = b0 + b1 Xi + ei, is 0.000, providing strong evidence that a power transformation will be useful. The estimated power is 1 – 0.37 = 0.63, which is approximately equal to 0.5. Based on the scatterplot above, we might try using log(Yi) or (Yi)0.5 as response variables. Let’s look at these responses versus the Xi’s.
MTB > name c4 'sqtY' c5
'logY'
MTB > let sqtY = sqrt(y)
MTB > let logY = log(Y)
Now plot them using the “Graph” pull-down menu. You will see the following commands appear in the session window, in case you want to do all of this by hand:
MTB > Plot 'sqtY'*'x'
'logY'*'x';
SUBC> Symbol;
SUBC> ScFrame;
SUBC> ScAnnotation.
The plots show that the log-transformation has
mitigated the non-constant variance but the square-root-transformation did not
(a plot of (Yi)-1 versus Xi shows dramatically
increasing variance):
Let’s regress the log-transformed Yi’s onto the Xi’s. There are repeated observations at the predictor values and there appears to be some curvature to this plot, so we will also perform the pure-error lack of fit test.
MTB > regress logY 1 x;
SUBC> pure.
Analysis of Variance
Source DF SS
MS F P
Regression 1 82.721
82.721 81.53 0.000
Residual Error 48
48.699 1.015
Lack of Fit 3 21.714 7.238 12.07 0.000
Pure Error 45 26.985 0.600
Total 49 131.419
We obtain the following test statistic F* = [(SSE(R) -SSE(F))/(dfE(R)-dfE(F))] / MSE(F) = [21.714/3] / 0.600 = 12.07 and there is a corresponding p-value of 0.000 for the null hypothesis
H0: E{ log(Yi) } = b0 + b1 Xi, i.e. that a line is appropriate. From the prototypical plots on page 127, we will try the inverse-transformation in the manufacturing ratio: (Xi)-1. First we define the reciprocal of x:
MTB > name c8 'Xinv'
MTB > let Xinv = 1/x
A scatterplot of log(Yi) versus (Xi)-1 shows drastic improvement:
We regress log(Yi) onto (Xi)-1, save the studentized residuals, and present several diagnostic plots. Under “Stat” choose “Regression” and then “Regression…” again. As the “Response:” choose logY and as the “Predictors:” choose Xinv. Click the “Graphs…” button, choose “Standardized” residuals and choose to display a “Histogram of residuals” and the “Residuals versus fits,” and click “OK.” Click “Options…” and click the “Pure error” Lack of Fit Test and click “OK.” Click “Storage…” and choose to store the “Standardized residuals” and the “Fits” in columns and click “OK.” Click “OK” again and we have the following output:
Regression Analysis: logY versus
Xinv
The regression equation is
logY = 1.15 + 0.497 Xinv
Predictor Coef SE Coef T P
Constant 1.1537 0.1990 5.80 0.000
Xinv 0.49732 0.03677 13.52 0.000
S = 0.7544 R-Sq = 79.2% R-Sq(adj) = 78.8%
Analysis of Variance
Source DF SS
MS F P
Regression 1 104.10
104.10 182.89 0.000
Residual Error 48
27.32 0.57
Lack of Fit 3 0.34 0.11 0.19 0.905
Pure Error 45 26.98 0.60
Total 49 131.42
Unusual Observations
Obs Xinv logY Fit SE Fit Residual St Resid
19 2.5 0.807 2.397 0.131 -1.590 -2.14R
R denotes an observation with
a large standardized residual
Note that Minitab flags observations that have a studentized residual unusually large in magnitude. These observations are potential outliers, points not fit well by the model. You can see observation 19 on the previous scatterplot at the bottom of the cluster of points at 1 / X19 = 2.5.
The pure-error lack of fit test yields the following results: SSLOF=SSE(R)-SSE(F)=0.34. dfLOF = dfE(R) – dfE(F) = 5 – 2 = 3. MSLOF = SSLOF / dfLOF = 0.34 / 3. This statistic has an F(dfLOF, dfE(F)) distribution when the mean is a line.
We have highly significant regression parameters and 79.2% of the variability in logY is explained by Xinv. The p-value 0.905 indicates no lack of fit, and the histogram of standardized residuals and standardized residuals versus the fitted values show no evidence on non-normality, non-constant variance, or a pattern of over or under fitting:
The standardized residuals were saved in a column labeled SRES1; we perform the Ryan-Joiner normality test on them (under “Stat”, “Basic Statistics”, and “Normality Test…”) and obtain the following output:
We see the resulting p-value is larger than 0.1. By comparison, the Anderson-Darling test yields a p-value of 0.733 and the Kolmogorov-Smirnov test yields a p-value larger than 0.15. All three tests don’t reject the null hypothesis that the standardized residuals came from a normal distribution. Simply to demonstrate how one would perform the modified Levene test in this circumstance, look at the standardized residuals versus the fitted values above. The plot looks fine, but we may detect a slight decrease in variance for residuals corresponding to fitted values larger than 3. We create an index separating the residuals into groups corresponding to fits larger than 3 and fits smaller than 3 in the following manner, assuming that the fitted values were saved in a column labeled FITS1:
MTB > Name C11 = 'Group'
MTB > Let 'Group' =
Sign(FITS1-3)
This places a “1” under Group when a fitted value is larger than 3 and a “-1” otherwise. Under “Stat” we pick “ANOVA” and then “Test for Equal Variances…” Choose SRES1 as the “Response:” and Group as the “Factors:” and click “OK.” We see the following:
Test for Equal Variances
Response SRES1
Factors Group
ConfLvl 95.0000
Levene's Test (any continuous
distribution)
Test Statistic: 0.520
P-Value : 0.474
The p-value for the hypothesis that the population variances in the two groups are equal is 0.474. We accept they are equal.
Finally, we wish to convert our statistical model back into the original units.
The median survival time as a function of the covariates is estimated to be
Med{ Y } @ exp( 1.15 + 0.497 / X ).
Why? Remember the model is log(Y) = b0 + b1 / X + e where e ~ N(0, s2). We have P(e < 0) =0.5 and thus P( Y < exp(b0 + b1 / X) ) = P( exp(b0 + b1 / X + e) < exp(b0 + b1 / X) ) = P( exp(b0 + b1 / X ) exp (e) < exp(b0 + b1 / X) ) = P(exp (e) < 1 ) = P(e < 0 ) =0.5. We can also obtain prediction intervals, for example when X = 0.3:
MTB > regress logY 1 Xinv;
SUBC> pred 3.33.
Predicted Values for New
Observations
New Obs Fit
SE Fit 95.0% CI 95.0% PI
1 2.810
0.116 ( 2.577,
3.043) ( 1.275,
4.345)
The prediction interval is then ( exp(1.275), exp(4.345) ) = ( 3.58, 77.09 ) minutes. The estimated median survival time at this ratio is exp( 2.81 ) = 16.6 minutes.