gr <- read.csv(pathtofile, # replace pathtofile
colClasses = c("numeric","numeric","factor"))
gr$r <- gr$B / gr$A # compute ratio B / ASTAT 516 hw 7
Students in several classes were asked to measure on their left hands the distances
It is of interest to see whether our (humans’) fingers grow according to the golden ratio such that the mean of the ratio
A comma-separated-values file containing the data can be downloaded from here. It can be read into R with the following code:
There were different numbers of students in each class, so the design is unbalanced. In the first part of this homework we will make a balanced version of the data set; in the second part we will work with the full data set.
Part I
Construct a balanced data set by running the code below, which draws (without replacement) a subsample of size
# fix the random seed so the same subsets
# are drawn every time the code runs
set.seed(1)
classes <- levels(gr$class)
a <- length(classes)
n <- 15
gr_bal <- data.frame()
for(i in 1:a){
ind <- which(gr$class == classes[i])
gr_bal <- rbind(gr_bal,gr[sample(ind,n),])
}
head(gr_bal) B A class r
71 4.1 2.5 515_fa_2023 1.640000
50 5.4 3.2 515_fa_2023 1.687500
53 5.3 3.2 515_fa_2023 1.656250
47 5.3 3.2 515_fa_2023 1.656250
48 5.3 3.3 515_fa_2023 1.606061
57 4.7 3.1 515_fa_2023 1.516129
table(gr_bal$class)
515_fa_2023 515_sp_2024 515_sp_2026 516_sp_2026
15 15 15 15
1.
Using the balanced data, report the mean of the ratio
aggregate(r ~ class, data = gr_bal, mean) class r
1 515_fa_2023 1.555191
2 515_sp_2024 1.663325
3 515_sp_2026 1.646862
4 516_sp_2026 1.492417
2.
Produce side-by-side boxplots of the observed responses in the different classes.
boxplot(r ~ class, data = gr_bal)3.
Obtain method of moments estimators for
lm_out <- lm(r ~ class, data = gr_bal)
y <- gr_bal$r
y.. <- mean(gr_bal$r)
SStot <- sum((y - y..)**2)
SSerror <- sum(lm_out$residuals**2)
SStrt <- SStot - SSerror
MSerror <- SSerror / (a*(n-1))
MStrt <- SStrt / (a-1)
sigma_e_mom <- sqrt(MSerror)
sigma_A_mom <- sqrt((MStrt - MSerror)/n)We obtain 0.004569735 for the random
effect variance and 0.02816821 for the error
term variance.
4.
Check whether these are the same as the restricted maximum likelihood estimates obtained with lmer().
library(lmerTest)
lmer_out <- lmer(r ~ 1 + (1 | class),data = gr_bal)
summary(lmer_out)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: r ~ 1 + (1 | class)
Data: gr_bal
REML criterion at convergence: -35.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.2347 -0.4149 0.2143 0.5852 1.6928
Random effects:
Groups Name Variance Std.Dev.
class (Intercept) 0.00457 0.0676
Residual 0.02817 0.1678
Number of obs: 60, groups: class, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.58945 0.04015 3.00000 39.59 3.55e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5.
Report the test statistic and
anova(lm_out)Analysis of Variance Table
Response: r
Df Sum Sq Mean Sq F value Pr(>F)
class 3 0.29014 0.096714 3.4335 0.02293 *
Residuals 56 1.57742 0.028168
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ftest <- MStrt / MSerror
pval <- 1 - pf(Ftest,a-1,a*(n-1))The F statistic for testing whether the variance of
the random effect for class is greater than zero is 3.433454
with a p value of 0.02293202. The fairly small p value indicates
evidence of significant variability in the responses
from classroom to classroom.
6.
Obtain for each class a prediction (a guess) of the realized value of the class random effect
yi. <- aggregate(r~class, mean, data = gr_bal)$r
Ahat <- (n * sigma_A_mom**2)/(n * sigma_A_mom**2 + sigma_e_mom**2) * (yi. - y..)
names(Ahat) <- classes
Ahat515_fa_2023 515_sp_2024 515_sp_2026 516_sp_2026
-0.02428045 0.05235980 0.04069176 -0.06877111
7.
Construct a
alpha <- 0.05
tval <- qt(1-alpha/2,a-1)
me <- tval * sqrt(MStrt/(a*n))
lo <- y.. - me
up <- y.. + me
c(lo,up)[1] 1.461678 1.717219
# Automatic lmer confidence interval
confint(lmer_out)Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 0.0000000 0.1649464
.sigma 0.1409682 0.2044551
(Intercept) 1.5011420 1.6777557
Since the 95% confidence interval (computed by
either method) contains the null
value 1.618, it does seems like a plausible claim that
the true mean is equal to this value.
8.
Remove the set.seed(1) command (so that new subsamples of data are drawn in the construction of the balanced data set) and run your analysis again: Do you get the same results? How stable is the analysis? Do you have any concerns?
This is an open-ended question. Some
subsamples will carry insufficient evidence for the rejection
of the null hypothesis that the random effect variance is
equal to zero, suggesting that the full data do not carry
strong evidence against this null. One obtains most of the
time an interval for the overall mean which includes the golden
ratio, so our main conclusion appears to be quite
reliable.
Part II
Now use the entire data set.
1.
Produce side-by-side boxplots of the responses in the treatment groups.
boxplot(r ~ class, data = gr)2.
Report the number of observations in each class.
table(gr$class)
515_fa_2023 515_sp_2024 515_sp_2026 516_sp_2026
27 21 25 19
3.
Check whether the assumptions of the one-way random effects model hold for these data.
lmer_out <- lmer(r ~ 1 + (1|class), data = gr)
plot(lmer_out)yhat <- predict(lmer_out)
y <- gr$r
ehat <- y - yhat
qqnorm(scale(ehat))
abline(0,1)There seems to be some deviation from
normality in the normal QQ plot, and one of the classes
appears to have a variance somewhat smaller than that
of the others. Since the sample sizes are moderate,
the sample means will be somewhat normal, though
we should still be a little skeptical of our
confidence intervals and p values.
4.
Obtain a
confint(lmer_out)Computing profile confidence intervals ...
2.5 % 97.5 %
.sig01 0.000000 0.1323279
.sigma 0.135656 0.1824887
(Intercept) 1.530808 1.6715922
5.
Consider the interval computed by the code below. Explain the strategy behind the construction of this interval and carefully explain whether you think it is appropriate or inappropriate (trustworthy or untrustworthy).
y <- gr$r
y.. <- mean(y)
sn <- sd(y)
N <- length(y)
alpha <- 0.05
tval <- qt(1-alpha/2,N-1)
lo <- y.. - tval * sn / sqrt(N)
up <- y.. + tval * sn / sqrt(N)
c(lo,up)[1] 1.568001 1.635472
This is the ordinary confidence interval for a mean
based on a single random sample. This is the interval one would
compute if one had collected all the data in a single class,
regarding this as a random sample. Note that one does not come to
a different conclusion vis-a-vis the golden ratio hypothesis
based on this interval. However, this interval ignores the
possibility of random class-to-class variability, which may not be
appropriate. If one has not found any significant
evidence of class-to-class variability, it may be okay to pool
all the data together and treat the values as though they arose
from a single sample. I believe, however, that this would be
somewhat naive, due to the fact that (at least in our exploration
of the balanced data) we obtained a fairly small p value for testing
the null hypothesis that the variance governing class-to-class
variation was equal to zero (this p value is somewhat harder
to obtain in the unbalanced case). A possible warning against
using this interval is that it turned out to be narrow than
the interval we obtain for the overall mean under the one-way
random effects model. If there were no need to include the random
effect, the two intervals would have more similar widths.