The following R code reads in a data set containing, for each of 7 days, the lengths of time in hours spent by women in the delivery suite while giving birth (without a ceasarian section) at John Radcliffe Hospital in Oxford, England. The data are taken from Davison (2003).
# Read in the data for each day:
day1 <- c(2.1,3.4,4.25,5.6,6.4,7.3,8.5,8.75,8.9,9.5,9.75,10,10.4,10.4,16,19)
day2 <- c(4,4.1,5,5.5,5.7,6.5,7.25,7.3,7.5,8.2,8.5,9.75,11,11.2,15,16.5)
day3 <- c(2.6,3.6,3.6,6.4,6.8,7.5,7.5,8.25,8.5,10.4,10.75,14.25,14.5)
day4 <- c(1.5,4.7,4.7,7.2,7.25,8.1,8.5,9.2,9.5,10.7,11.5)
day5 <- c(2.5,2.5,3.4,4.2,5.9,6.25,7.3,7.5,7.8,8.3,8.3,10.25,12.9,14.3)
day6 <- c(4,4,5.25,6.1,6.5,6.9,7,8.45,9.25,10.1,10.2,12.75,14.6)
day7 <- c(2,2.7,2.75,3.4,4.2,4.3,4.9,6.25,7,9,9.25,10.7)
# Combine the data together, get sample size:
hours <- c(day1,day2,day3,day4,day5,day6,day7)
n <- length(hours)
n
## [1] 95
# Produce a histogram and Normal QQ plot of the data
par(mfrow=c(1,2))
hist(hours)
qqnorm(scale(hours))
abline(0,1)
Suppose we regard the data as a random sample from the population of all times spent in the delivery suite by women while giving birth. If we regard the data as a random sample, then we are assuming that the \(n=95\) values in the data set are realizations of some indepedent random variables \(X_1,\dots,X_{95}\), where each of these random variables has the same probability density function \(f\), say, where \(f\) is not known to us. We study in STAT 512 what we can learn about \(f\) from the values \(X_1,\dots,X_{95}\). Here are just a couple of examples.
We may wish to know the value of \(\mu := \int_{\mathbb{R}}xf(x)dx\), which would be the expected value of the time spent in the delivery suite. We will learn that the sample mean \[ \bar X_n = n^{-1}\sum_{i=1}^n X_i \] is a good estimator of \(\mu\).
# Estimate the mean time in the delivery suite for the "population":
mean(hours)
## [1] 7.723158
We may wish to associate some measure of uncertainty with our estimate of \(\mu\). We will learn that the interval (which we will call a confidence interval) \[ \bar X_n \pm \Phi^{-1}(1-\alpha/2)\frac{S_n}{\sqrt{n}} \] will contain the true value \(\mu\) with probability approaching \(1-\alpha\) as \(n\to\infty\), where \(\Phi(z) = \int_{-\infty}^z \frac{1}{\sqrt{2\pi}}e^{-t^2/2}dt\) and \[ S_n = \sqrt{\frac{1}{n-1}\sum_{i=1}^n(X_i - \bar X_n)^2 }. \]
# Construct a 95% confidence interval for the population mean:
lower <- mean(hours) - qnorm(.975) * sd(hours) / sqrt(n)
upper <- mean(hours) + qnorm(.975) * sd(hours) / sqrt(n)
c(lower,upper)
## [1] 7.005656 8.440660
We might assume that \(f\) is one of the gamma pdfs for some values of \(\alpha\) and \(\beta\); then we might wish to estimate \(\alpha\) and \(\beta\).
# Assume distribution is gamma and estimate alpha and beta using maximum-likelihood:
library(MASS) # The R package MASS has the function fitdistr()
mle <- fitdistr(hours, "gamma", lower = 0)$estimate
shape.hat <- mle[1] # estimated value of alpha
rate.hat <- mle[2] # estimated value of 1/beta
par(mfrow=c(1,2))
# Plot a histogram of the data and overlay the best-fitting gamma density:
x.seq <- seq(0,20,length=100)
hist(hours,freq=FALSE)
lines(dgamma(x.seq,shape=shape.hat,scale=1/rate.hat)~x.seq,col="blue",lwd=2)
# Q-Q plot comparing the data quantiles to those of the best-fitting gamma density:
gamma.quantiles <- qgamma((1:n)/(n+1),shape=shape.hat,scale=1/rate.hat)
plot(sort(hours)~gamma.quantiles,
xlab=paste("Gamma(alpha=",round(shape.hat,2),
", beta=",round(1/rate.hat,2),") Q'tiles",sep=""),
ylab="Data Quantiles")
abline(0,1)
Looking at the quantile-quantile plot on the right, it seems quite reasonable to assume that the values \(X_1,\dots,X_{95}\) have come from a gamma distribution.
Davison, A. C. Statistical Models. Cambridge University Press, 2003.