If you get stuck and need to contact your instructor, please include a copy of the data set you are using and the code (or menu options) that you are trying.
"The Basics of R" section below is designed to be fairly easy to go through. A .pdf copy of that section and of the output can be downloaded and printed out to take notes on. This entire page is designed to be worked through while you are on a computer with both R and a copy of this page open.
The data sets for the text book can be found at: https://www.pearsonhighered.com/mathstatsresources/#author=M in the Statistics, 13/e section. They can be downloaded in TXT format to your computer in a zip file, and then be read in from there. When reading them in, the slashes must always be / forward slashes in your R code.
The Basics of R:
R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then "Download R for Windows", then "install R for the first time", then "Download R 3.4.0 for Windows" -- there is also a link for the Mac or LINUX]. The CRAN page also has links to FAQs and other information about R. This page also has links to FAQs and other information about R.
When you start R, you will see several standard looking menus along the top (File, Edit, etc...) and a white R Console window taking up the majority of the screen. This window is where you will both enter any data, commands, or "programs" that you wish to run, and also where the output will appear.
Because this window is used for so many things it often fills up quickly - and so, if you are doing anything involving a large number of steps, it is often easiest to type them in a text editor first. There are several built in or designed to use with R. One option is to choose New Script under the File menu. You can then enter your commands in that window and run it by selecting the text and using Ctrl+R. You can also save those scripts for later use. We've also found it easy across platforms and time to use a really basic program like Notepad in Windows and then copy and paste the code over as needed.
One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are working on a machine you have administrator access to it will ask you if you want to save the workspace when you quit the program (choose yes, and all the entered data and functions will be there next time you start it up). If you are on a networked machine where you don't have administrator access, or if you want to be able to access the saved data from several machines then you can load and save the .Rdata file on a disk or pen-drive (use the Load Workspace and Save Workspace options under the File menu).
At the heart of R are the various objects that you enter. At any time (when you are in R) you can type in the function objects() to see which ones you have entered previously. Presumably there aren't any there right now if you've never used R before.. R also has many built in objects, most of which are functions. Because these objects are always there, it will not list them when you type objects().
A list of the various functions for R can be found in the manuals that are installed with the program. The various manuals can be found under the Manuals submenu of the Help menu. In particular, the R Reference Manual lists not only all the functions in the base package of R, but also some of those in the supplementary packages as well. Because this manual is nearly 1000 pages long, you probably shouldn't print out a copy!!!! A large number of supplementary packages (functions that others wrote and decided to make available to the public) can be found under the Packages menu by choosing Load Package. One popular one is the MASS library that is designed to accompany Venables and Ripley's Modern Applied Statistics with S. It can also be loaded in by typing library(MASS).
Once you know the name of a function, you can get help about it simply by typing help(functionname). In R, parentheses indicate either mathematical grouping, like they usually do, or that you are applying a function. Brackets [ ] indicate that you are finding an element in a string.
The following demonstrates some of the basic ideas of how R syntax works. Simply read along, and whenever you see something in type-writer font simply cut and paste that into the R window and hit return.
x<-c(5,4,3,2)
The <- is the command that assigns what ever is on the left to the name that is on the right. The c indicates that an array or vector is going to be entered. Simply typing x now will return those four values. Typing
x[3]
will return the third value,
mean(x)
will return the mean, and
objects()
will show you that you now have an object called x. Other functions to try include sd(x), var(x), median(x), and summary(x).
R also has a variety of graphical functions built in.
hist(x)
will construct a histogram. Unfortunately it uses the right end-point rule by default and fudges a few things sometimes! Checking help(hist) will reveal that "A numerical tolerance of 1e-7 times the median bin size is applied when counting entries on the edges of bins."
We could get the correct histogram by insuring that the screen is set up wide enough and that we told it to use the left endpoint rule.
hist(x,breaks=seq(1,7,by=1),right=F)
Try it with 2,5 and see that the fudging messes things up!
If this was discrete data, the way to get the axis labeled correctly would be
hist(x,breaks=seq(1.5,5.5,by=1),right=F)
It also very easy to simulate data sets in R. For example, the following code will make 1,000 fake IQ scores from a normal distribution with mean 100 and standard deviation 15. (If you run this yourself you will get slightly different values since you will have your own random sample of 1,000 IQs!)
IQ<-rnorm(1000,mean=100,sd=15)
IQ
mean(IQ)
sd(IQ)
hist(IQ,right=F)
Besides just getting the result of a function, we can store the results somewhere. In the code above we put the values from the function rnorm into IQ. As another example,
medianofIQ<-median(IQ)
will store the median in medianofIQ.
We can use some built in function in R to check out how the empirical rule and Chebychev's inequality work for this data.
sum((IQ>=85)&(IQ<115))/1000
sum((IQ>=70)&(IQ<130))/1000
sum((IQ>=55)&(IQ<145))/1000
If we wanted to try this on a skewed data set, we could use something we will see later called the chi-squared distribution.
Newdat<-rchisq(1000,df=2)
hist(Newdat)
mean(Newdat)
median(Newdat)
We can check the percents again using:
sum((Newdat>=mean(Newdat)-sd(Newdat))&(Newdat< mean(Newdat)+sd(Newdat)))/1000 sum((Newdat>=mean(Newdat)-2*sd(Newdat))&(Newdat< mean(Newdat)+2*sd(Newdat)))/1000 sum((Newdat>=mean(Newdat)-3*sd(Newdat))&(Newdat< mean(Newdat)+3*sd(Newdat)))/1000
Percentiles can be gotten using the quantile function. For example:
quantile(IQ,.5)
quantile(IQ,.25)
quantile(IQ,.75)
will give the median, 1st quartile (25th-percentile), and 3rd quartile (75th-percentile). help(quantile) explains the nine different ways it has built in to calculate the quantiles. Using the option type=6 gives the same result as Minitab and SAS (e.g. quantile(IQ,.75,type=6) ).
R can also be used to write your own functions. For example, if you wanted a function that returned the inter-quartile range we could write:
IQRfun<-function(datavector,type=7){
Q3<-quantile(datavector,.75,type=type)
Q1<-quantile(datavector,.25,type=type)
as.numeric(Q3-Q1)}
After copying that function over, try IQRfun(IQ). You can compare it to the built in function IQR(IQ) that didn't exist in previous editions of the program. The function above allows for the quantile type to change too, though.
The five number summary (plus the mean) can be gotten by using:
summary(IQ)
Unfortunately it isn't set up to easily allow you to change the type.
A boxplot is a graph based on the five-number summary:
boxplot(IQ)
boxplot(Newdat)
R is able to read data in both from files and from appropriately formatted web-pages. The following command will read in the data set contained at http://people.stat.sc.edu/habing/515/data/examp9p4.txt that is used in example 9.4 in the text.
e9p4 <-read.table("http://people.stat.sc.edu/habing/515/data/examp9p4.txt",header=TRUE)
Typing e9p4 will show you the entire data set, and attributes(e9p4) will give you a list of the various names associated with this data set.
Try the following to see if you can tell what they are returning:
e9p4[,2]
e9p4[2,]
e9p4$value
The first one is giving the second column of the data set - the notation inside of the square brackets is always of the form "[which rows, which columns]". e9p4[,2] is saying to give the second column, and since it doesn't specify, all of the rows. The second one, e9p4[2,] is therefore giving the econd row of the data set. The third one, e9p4$value, uses one of the names you saw when you entered attributes(e9p4) to specify the name of a column. When using names in this way, it's always after a dollar sign.
If you wanted to, you could put the two variables in their own vectors using either:
group<-e9p4$group
value<-e9p4$value
group<-e9p4[,1]
value<-e9p4[,2]
NOTE: Many functions in R will not work directly on data you read in using read.table. You need to either specify which columns you wanted to use (like the above few lines) or if all of the columns are data, you could use as.matrix(name.you.used.for.the.table).
If you want to use all of the data
We could also make subsets of the data. For example, try:
new<-e9p4[e9p4[,1]=='new',2]
new
stand<-e9p4[e9p4[,1]=='stand',2]
stand
What does these seem to return? Why?
The first part in the square brackets says which rows, all the ones where the second column has the value 1 (notice the double equal sign is used there). The second part in the square brackets are the columns, and here they are listed as a vector of names. (Check the original data set to make sure it was done correctly!).
Boxplots of the two could be made using either compare:
plot(group,value)
or
boxplot(value~group)
The ~ will be used later on when designing statistical models.
It is a lot easier to compare the median and other percentiles of distributions using boxplots than it is with histograms:
par(mfrow=c(2,1))
hist(new,right=F,breaks=seq(60,90,by=5))
hist(stand,right=F,breaks=seq(60,90,by=5))
par(mfrow=c(1,1))
Finally, one more plotting example for our simulated normal data set:
hist(IQ,freq=F,xlim=c(45,150),ylim=c(0,.04),right=F)
par(new=T)
plot(density(IQ),xlim=c(45,150),ylim=c(0,.04),xlab="",ylab="",main="")
A Few Common Functions:
Mathematical functions:
sqrt() | square root |
exp() | exponent (e to the power) |
log() | the natural logarithm by default |
abs() | absolute values |
floor() | round down |
ceiling() | round up |
round() | round to the nearest (even if .5) |
Statistical functions:
mean() | find the mean |
median() | find the median |
sd() | find the standard deviation |
var() | find the variance |
quantile() | find the quantiles (percentiles); requires the data and the percentile you want e.g. quantile(normal.sample,.5) is the median |
max() | find the maximum |
min() | find the minimum |
summary() | find the 5-number summary |
hist() | construct a histogram |
boxplot() | construct a boxplot |
qqnorm() | construct a normal quantile-quantile plot |
qqline() | add the line to a normal quantile-quantile plot |
Probability Distributions:
Probability distributions in R are typically accessed using one of the letters d (the density or mass function), p (the cumulative distribution function), q (the opposite of the cumulative distribution function), or r (generating a random value) in front of the abbreviation for the distribution. Example abbreviations include binom, hyper, pois, exp, and norm for the basic distributions we have discussed. It is often important to use the help function to make sure you are using the parameters in the right way. The hypergeometric, in particular, is parameterized slightly differently.
*Binomial: The first three various values in example 4.11 on page 206 could be gotten using the probability mass function dbinom. The first value is x, the second value is n, and the third value is p:
dbinom(0,4,.1)
dbinom(1,4,.1)
dbinom(2,4,.1)
The answers to parges b and c or example 4.13 on page 208 can be gotten using the cumulative distribution function pbinom:
pbinom(10,20,.6)
1-pbinom(12,20,.6)
*Poisson: The answer to part b of example 4.14 on page 215 can e found using the probability mass function dpois The answers to part c and d can be found using the probability mass function ppois. In both cases the first value is x and the second value is lambda.
dpois(5,2.6)
ppois(1,2.6)
1-ppois(5,2.6)
*Hypergeometric: The answer to part b of example 4.15 on page 220 can be found using the probability mass function dhyper. The first value is x, the second value is r, the third value is N-r, and the final value is n.
dhyper(0,4,10-4,3)
*Normal: The normal probabilities can be found using the cumulative probability distribution pnorm. The answers to example 5.3 on page 241 can be found by founding the area to the left of 1.33 and subtracting off the area less than -1.33.
pnorm(1.33)-pnorm(-1.33)
To get the 0.4082 values in the book you would use:
pnorm(0)-pnorm(-1.33)
pnorm(1.33)-pnorm(0)
Remembering that the cumulative distribution function gives the probability of being less than or equal to, the answers to 5.4 on page 242 is found using:
1-pnorm(1.64)
For normals that aren't standard normal, you can just enter in the mean and standard deviation as the second and third arguments. For example 5.7 on page 244 it would be:
pnorm(12,10,1.5)-pnorm(8,10,1.5) Note that this differs from the book because the book rounds to 1.33 instead of using 4/3.
To get the z or x values that go with a certain probability, you would use the inverse cumulative distribution function qnorm. The code to solve examples 5.9-5.11 on page 246-248 is:
qnorm(.9)
qnorm(.025)
qnorm(.9,550,100)
*Exponential: The probabilities for an exponential distribution can be found using the cumulative distribution function pexp. By default it uses the lambda from the Poisson = 1/theta as the parameter. So to find the values for 5.14 and 5.15b on pages 268-269, you would use:
1-pexp(5,1/2)
pexp(5,1/6.25)
The answer to 5.15c uses the inverse cumulative distribution function:
qexp(.275,1/6.25)
*t, chi-squared, and F: The cumulative distribution functions (pt, pchisq, pf) and inverse cumulative distribution functions (qt, qchisq, qf) for these three distributions work just like those for the normal and exponential. The second parameter (and third parameter for the F) are the degrees of freedom.
#Highlighted value in Table 7.3 on page 328:
qt(.975,4)
#p-value for example 8.9 on page 396:
pt(-2.60,9)
#Highlighted value in Table 8.7 on page 418:
qchisq(0.050,9)
#p-value that goes with chi-square=1.524 in Figure 8.29 on page 419:
pchisq(1.524,9)
#F-value in figure 9.20 on page 481:
qf(.95,12,17)
#p-value that goes with example 9.11 on page 481 on page 481:
2*(1-pf(4.24,12,17))
Assessing Normality:
Code for example 5.12 on page 254-256.
The dataset EPAGAS can either be found on the CD with the book, or from the web.
#Read in the data using the scan function since it is
# only one column. You could also read it in using read.table
# as in the introductory section.
EPA<-scan("http://people.stat.sc.edu/habing/515/data/EPAGAS.txt")
#Figure 5.21a
#Draw the histogram, setting the breaks, making the correct end-point rule
# and put the density instead of frequency on the axis
hist(EPA,breaks=seq(29.5,45.5,by=1),right=F,freq=F)
#Set up a sequence of values so that we can draw the normal curve
MPG<-seq(min(EPA),max(EPA),by=0.1)
#Draw a normal curve at the values we set-up using the mean and standard
# deviation of the original data set
lines(MPG,dnorm(MPG,mean(EPA),sd(EPA)))
#Figure 5.21b
#Get the 5 number summary using quantile method 6, as well as the mean and sd
mean(EPA)
sd(EPA)
quantile(EPA,c(0,.25,.5,.75,1),type=6)
IQR(EPA,type=6)
IQR(EPA,type=6)/sd(EPA)
#Figure 5.21c
#Note that the default in R is like we did in class.
# It is turned sideways from the example in the book
# and R uses a slightly more robust line - it passes through
# the 25th and 75th percentiles so that outliers
# don't affect it like the one in the book.
qqnorm(EPA);qqline(EPA)
#To get the graph like the book (I wouldn't!) you could use
qqnorm(EPA,datax=T)
abline(lm(qnorm(ppoints(EPA))~sort(EPA)))
#Code for Table 5.3 can be found in the "Basics of R" section above.
Confidence Intervals:
*Interval for one mean: Example 7.5, pages 330-331
printhead<-c(1.13,1.55,1.43,.92,1.25,1.36,1.32,.85,1.07,1.48,1.20,1.33,1.18,1.22,1.29)
t.test(printhead,conf.level=0.99)
qqnorm(printhead)
qqline(printhead)
Note that the same function makes both confidence intervals and does the test. The options for the test sometimes change the interval though so I wouldn't look at the CI if you specified an alternative (unless you meant to find a lower or upper confidence bound instead of an interval). The value of conf.level tells you what percent the confidence interval is for. The last two lines are for checking the assumptions. This is either a bit skewed left or a bit heavy tailed, but doesn't look too bad (especially since this relationship is fairly robust).
*Interval for One Proportion: Example 7.8, pages 340-341
There are many different versions of the confidence interval for proportions, and R doesn't do the one we want. To get the Agresti-Coull confidence interval (on page 340) I would first cut and paste in the following function:
propAC<-function(x1,n1,x2=0,n2=0,type="4",conf.level=0.95){
za2<-qnorm(1-(1-conf.level)/2)
if (type=="4"){
adj<-4}
if (type=="z"){
adj<-za2^2}
if (n2==0){
ptilde<-(x1+0.5*adj)/(n1+adj)
za2<-qnorm(1-(1-conf.level)/2)
me<-za2*sqrt(ptilde*(1-ptilde)/(n1+adj))
CI<-c(ptilde-me,ptilde+me)}
if (n2>0){
ptilde1<-(x1+0.25*adj)/(n1+0.5*adj)
ptilde2<-(x2+0.25*adj)/(n2+0.5*adj)
za2<-qnorm(1-(1-conf.level)/2)
me<-za2*sqrt(ptilde1*(1-ptilde1)/(n1+0.5*adj) + ptilde2*(1-ptilde2)/(n2+0.5*adj))
CI<-c(ptilde1-ptilde2-me,ptilde1-ptilde2+me)}
CI
}
You only ever need to put that function in once and then can keep re-using it with different data values. To get the output in Example 7.8 on pages 323-324 I would use:
propAC(3,200,conf.level=0.95)
Note that it is different than what the book has because the book rounded a lot. 5/204 is 0.0245 and they rounded it to 0.025. You should really only round very much at the end. For example, you could round the final interval to four decimal places using:
round(propAC(3,200,conf.level=0.95),4)
Using the type="z" option will give the more general version of this interval where z2 new observations are added instead of four. In general it is probably better to use this option, especially if you aren't making a 95% interval.
*Interval for One Variance: Example 7.11, pages 352-354
R doesn't have this built in (which isn't that horrible since we should rarely do it because it isn't robust... but it does make it odd that they have the F for two variances then.) In any case its pretty easy to write a function to do it. To analyze the data, cut and paste the function in without changing it. You only need to put the function in once.
chisquare.var<-function(y,sigma2=1,conf.level=0.95){
n<-length(y)
alpha<-1-conf.level
chisquare<-(n-1)*var(y)/sigma2
pval.low<-pchisq(chisquare,df=n-1)
pval.hi<-1-pchisq(chisquare,df=n-1)
pval.not<-2*min(pval.low,pval.hi)
cilow<-(n-1)*var(y)/qchisq(1-alpha/2,df=n-1)
cihi<-(n-1)*var(y)/qchisq(alpha/2,df=n-1)
list(chisquare=chisquare,pval.for.less.than=pval.low,
pval.for.greater.than=pval.hi,pval.for.not.equal=pval.not,
ci.for.variance=c(cilow,cihi),ci.for.sd=c(sqrt(cilow),sqrt(cihi)))}
The data set FISHDDT can either be found on the CD with the book, or from the web.
#Read in the data using read.table and then
# select the fith column (the one with the weight).
# The head=T tells it that there are column headings
# in the data set.
FISH<-read.table("http://people.stat.sc.edu/habing/515/data/FISHDDT.TXT",head=T)
Weights<-FISH[,5]
chisquare.var(Weights,conf.level=0.95)
qqnorm(Weights);qqline(Weights)
The confidence interval for the variance and standard deviation are at the bottom of the output. They differ slightly from those of the text book since the text book went up to 150 degrees of freedom instead of using 143. Some other programs may give slightly different answers because they may decide not to split the 5% in the two tails equally since the chi-squared distribution is skewed.
Since the q-q plot is slightly skewed right and the variance-chi-square relationship is no robust I would not trust this interval in this case.
*Interval for Two Means: Example 9.4, pages 442-444
New<-c(80,80,79,81,76,66,71,76,70,85)
Stand<-c(79,62,70,68,73,76,86,73,72,68,75,66)
t.test(New,Stand,conf.level=0.95)
par(mfrow=c(1,2))
qqnorm(New);qqline(New)
qqnorm(Stand);qqline(Stand)
par(mfrow=c(1,1))
This interval differs from the one in the book because it doesn't assume equal variances. The one with equal variances (with s-squared-p) would be run using:
t.test(New,Stand,conf.level=0.95,var.equal=T)
Note that the same function makes both confidence intervals and does the test. The options for the test sometimes change the interval though so I wouldn't look at the CI if you specified an alternative (unless you meant to find a lower or upper confidence bound instead of an interval). The value of conf.level tells you what percent the confidence interval is for. The last four lines are for checking the assumptions. One of the values for the Standard Method is out there a bit in the second q-q plot, but since this method is fairly robust I would still be ok with it.
*Interval for Paired Differences: Example 9.5 on pages 458-460
male<-c(29300,41500,40400,38500,43500,37800,69500,41200,38400,59200)
female<-c(28800,41600,39800,38500,42600,38000,69200,40100,38200,58500)
diff<-male-female
t.test(male,female,conf.level=0.95,paired=T)
t.test(diff,conf.level=0.95)
qqnorm(diff);qqline(diff)
Notice that two ways to do the same confidence interval are given, and that the last line is used to check the normality of the differences. As with the one- and two-sample t-intervals don't set the alternative if you are trying to make a confidence interval (unless you meant to find a lower or upper confidence bound).
*Interval for Two Proportions: Example on pages 465-466
There are several different versions of the confidence interval for two-proportions, and R doesn't do the one we want. To get the Agresti-Caffo confidence interval (from the formula sheet) I would first cut and paste in the following function:
propAC<-function(x1,n1,x2=0,n2=0,conf.level=0.95){
if (n2==0){
ptilde<-(x1+2)/(n1+4)
za2<-qnorm(1-(1-conf.level)/2)
me<-za2*sqrt(ptilde*(1-ptilde)/(n1+4))
CI<-c(ptilde-me,ptilde+me)}
if (n2>0){
ptilde1<-(x1+1)/(n1+2)
ptilde2<-(x2+1)/(n2+2)
za2<-qnorm(1-(1-conf.level)/2)
me<-za2*sqrt(ptilde1*(1-ptilde1)/(n1+2) + ptilde2*(1-ptilde2)/(n2+2))
CI<-c(ptilde1-ptilde2-me,ptilde1-ptilde2+me)}
CI
}
You only ever need to put that function in once and then can keep re-using it with different data values. To get the output similar to that at the top of page 441 I would use:
propAC(546,1000,475,1000)
Note that it differs from the one in the book because the one in the book doesn't use the correction.
*Interval for Two Variances: Example 9.12, page 482-483
New<-c(80,80,79,81,76,66,71,76,70,85)
Stand<-c(79,62,70,68,73,76,86,73,72,68,75,66)
var.test(New,Stand,conf.level=0.95)
par(mfrow=c(1,2))
qqnorm(New);qqline(New)
qqnorm(Stand);qqline(Stand)
par(mfrow=c(1,1))
Note that the same function makes both confidence intervals and does the test. The options for the test sometimes change the interval though so I wouldn't look at the CI if you specified an alternative. The value of conf.level tells you what percent the confidence interval is for. The last four lines are for checking the assumptions. It's hard to tell with this few points, but one of the observations for the Standard Method is out there a ways... and this method isn't very robust to violations of normality.
Hypothesis Tests:
*Test for One Mean: Examples 8.8-8.9 on pages 394-396
Emission<-c(15.6,16.2,22.5,20.5,16.4,19.4,19.6,17.9,12.7,14.9)
t.test(Emission,mu=20,alternative="less")
qqnorm(Emission);qqline(Emission)
The options for alternative are "greater", "less", and "two.sided" (the last one for not-equals, and its the default). The last line is for the assumption checking plot.
*Test for One Proportion: Example 8.10-8.11, page 402-403
binom.test(10,300,p=0.05,alternative="less")
Note this is the exact test (using the binomial distribution) and not the one the book gives. The options for alternative are "greater", "less", and "two.sided" (the last one for not-equals, and its the default).
To get the answer in the text you could use the function prop.test.
prop.test(10,300,p=0.05,alternative="less",correct=F)
Here the chi-squre value of 1.7544 is the square of the z value -1.32. Since this is a badly done (no continuity correction) approximate (normal approximation to the binomial) test, it should not be used instead of the exact one from binom.test. Notice that it would actually make a difference if you were using alpha=0.10.
*Test for One Variance: Example 8.13-8.14, pages 418-420
R doesn't have this built in (which isn't that horrible since we should probably never do it... but it does make it odd that they have the F for two variances then.) In any case its pretty easy to write a function to do it. To analyze the data, cut and paste the function in without changing it. You only need to put the function in once.
chisquare.var<-function(y,sigma2=1,conf.level=0.95){
n<-length(y)
alpha<-1-conf.level
chisquare<-(n-1)*var(y)/sigma2
pval.low<-pchisq(chisquare,df=n-1)
pval.hi<-1-pchisq(chisquare,df=n-1)
pval.not<-2*min(pval.low,pval.hi)
cilow<-(n-1)*var(y)/qchisq(1-alpha/2,df=n-1)
cihi<-(n-1)*var(y)/qchisq(alpha/2,df=n-1)
list(chisquare=chisquare,pval.for.less.than=pval.low,
pval.for.greater.than=pval.hi,pval.for.not.equal=pval.not,
ci.for.variance=c(cilow,cihi),ci.for.sd=c(sqrt(cilow),sqrt(cihi)))}
This example uses the data in Table 8.6 on page 417 and gives the output corresponding to page 419.
fill.weight<-c(16.00,15.95,16.10,16.02,15.99,16.06,16.04,16.05,16.03,16.02)
chisquare.var(fill.weight,sigma2=0.01)
qqnorm(fill.weight);qqline(fill.weight)
Notice that this output gives you all three p-values (the one in the example is the first one). The last line is the q-q plot for checking the assumptions.
*Test for Two Means: Example 9.4, pages 442-443
The following does the test with the data in Example 9.4 and the "not equals" alternate hypothesis.
New<-c(80,80,79,81,76,66,71,76,70,85)
Stand<-c(79,62,70,68,73,76,86,73,72,68,75,66)
t.test(New,Stand,alternative="two.sided")
par(mfrow=c(1,2))
qqnorm(New);qqline(New)
qqnorm(Stand);qqline(Stand)
par(mfrow=c(1,1))
This test gives the test statistic, p-value, and df on the line "Not Equal" in figure 9.6. The one with equal variances (labeled "Equal" and using s-squared-p) would be run using:
t.test(New,Stand,alternative="two.sided",var.equal=T)
The options for alternative are "greater", "less", and "two.sided" (the last one for not-equals, and its the default). The last four lines are for checking the assumptions.
*Test for Paired Differences: Example on pages 453-455
new<-c(77,74,82,73,87,69,66,80)
stand<-c(72,68,76,68,84,68,61,76)
diff<-new-stand
t.test(new,stand,alternative="greater",paired=T)
t.test(diff,alternative="greater")
qqnorm(diff);qqline(diff)
Notice that two ways to do the same test are given, and that the last line is used to check the normality of the differences. Options for alternative are "greater", "less", or "two.sided" (this last one is for not equals).
*Test for Two Proportions: Examples 9.6-9.7, pages 468-469:
R doesn't have this test built in for two proportions in a nice way. To run it, first cut and paste the function in exactly as is:
two.prop.test<-function(x1,n1,x2,n2){
p1hat<-x1/n1
p2hat<-x2/n2
pbar<-(x1+x2)/(n1+n2)
z<-(p1hat-p2hat)/sqrt(pbar*(1-pbar)*(1/n1+1/n2))
pval.low<-pnorm(z)
pval.hi<-1-pnorm(z)
pval.not<-2*min(pval.low,pval.hi)
list(z=z,pval.for.less=pval.low,pval.for.greater=pval.hi,
pval.for.not.equal=pval.not)}
Then to get the output on page 443 you could use:
two.prop.test(555,1500,578,1750)
*Test for Two Variances: Example 9.12, page 482-483
New<-c(80,80,79,81,76,66,71,76,70,85)
Stand<-c(79,62,70,68,73,76,86,73,72,68,75,66)
var.test(New,Stand,alternative="two.sided")
par(mfrow=c(1,2))
qqnorm(New);qqline(New)
qqnorm(Stand);qqline(Stand)
par(mfrow=c(1,1))
The options for alternative are "greater", "less", and "two.sided" (the last one for not-equals, and its the default).
Nonparametric Tests for Two Independent Samples:
*Mann-Whitney-Wilcoxon Test: The data in Section 14.3 [click here for .pdf file] can be analized using the function wilcox.test:
DrugA<-c(1.96,2.24,1.71,2.41,1.62,1.93)
DrugB<-c(2.11,2.43,2.07,2.71,2.50,2.84,2.88)
wilcox.test(DrugA,DrugB,alternative="less",exact=T)
Note that R reports the W statistic instead of the T one the book uses. The relationship between them is that W = T - n(n-1)/2 where n is the smaller of the two sample sizes.
If there was a tie between the two data sets you will get a warning message saying they had to use the approximate test. This is similar to page 14-4 in the text, except that R uses a continuity correction by default.
*Randomization Test: To do the randomization test (using reallocation, like at lock5stat.com) we could use the following R function:
reallocate<-function(X,Y,nreallocations=1000,alpha=0.05,alternative="two.sided"){ # Get Test Statistic teststat<-mean(X)-mean(Y) teststat # Get Critical Regions alldata<-c(X,Y) n1<-length(X) n2<-length(Y) diffvector<-NULL for (i in 1:nreallocations){ resampled<-sample(alldata,n1+n2,replace=F) diff<-mean(resampled[1:n1])-mean(resampled[(n1+1):(n1+n2)]) diffvector<-c(diffvector,diff) } # Graph, give critical region and test stat, and give p.value eps<-(max(diffvector)-min(diffvector))/20 hist(diffvector,nclass=50,xlim=c(min(teststat,min(diffvector))-eps,max(teststat,max(diffvector))+eps),right=F) lines(c(teststat,teststat),c(0,nreallocations),col="green",lwd=2) if (alternative=="two.sided"){ critval.low<-quantile(diffvector,alpha/2) critval.high<-quantile(diffvector,(1-alpha/2)) lines(c(critval.low,critval.low),c(0,nreallocations),col="red",lwd=2) lines(c(critval.high,critval.high),c(0,nreallocations),col="red",lwd=2) low<-sum(diffvector<=teststat)/(nreallocations+1) high<-sum(diffvector>=teststat)/(nreallocations+1) pvalue<-2*min(low,high) } if (alternative=="less"){ critval.low<-quantile(diffvector,alpha) critval.high<-NULL lines(c(critval.low,critval.low),c(0,nreallocations),col="red",lwd=2) pvalue<-sum(diffvector<=teststat)/(nreallocations+1) } if (alternative=="greater"){ critval.low<-NULL critval.high<-quantile(diffvector,(1-alpha)) lines(c(critval.high,critval.high),c(0,nreallocations),col="red",lwd=2) pvalue<-sum(diffvector>=teststat)/(nreallocations+1) } list(test.statistic=teststat,critical.value.low=critval.low, critical.value.high=critval.high,pvalue=pvalue) }
To use it on the drug example you would first cut and paste the function in (you only need to do that once!). You could then run the following code.
DrugA<-c(1.96,2.24,1.71,2.41,1.62,1.93)
DrugB<-c(2.11,2.43,2.07,2.71,2.50,2.84,2.88)
reallocate(DrugA,DrugB,alternative="less")
The histogram is of the difference of means calculated from all of the different reassignments of the observations to the two different groups. The green line is the test statistic, and the red line is the critical value. As noted in the supplement to Section 14.3, there can be some difficulties in figuring out what alpha level you haven if you use the rejection region approach, especially for small sample sizes -- and it seems better to just use the p-value.
Notice that every time you run the function you get a slightly different set of critical values and p-values because it is random. The argument nreallocations is the number of times you want to randomize it (the number of data points used to make the histogram) and the default is 1,000 so that you can make sure your computer is running it quickly enough. If it is, then it is probably good to increase that number - my computer can run this example with 100,000 reallocations in about 20 seconds. alpha is the alpha level of the test for the graph and critical values, and alternative is either "less", "greater", or "two.sided".
ANOVA:
For examples 10.4-10.5 on pages 516-519:
distance<-c(251.2,245.1,248.0,251.1,260.5,250.0,253.9,244.6,254.6,248.8,
263.2,262.9,265.0,254.5,264.3,257.0,262.8,264.4,260.6,255.9,
269.7,263.2,277.5,267.4,270.5,265.5,270.7,272.9,275.6,266.5,
251.6,248.6,249.4,242.0,246.5,251.3,261.8,249.0,247.1,245.9)
brand<-c(rep("A",10),rep("B",10),rep("C",10),rep("D",10))
brand<-as.factor(brand)
anova(lm(distance~brand))
tapply(distance,brand,mean)
par(mfrow=c(1,2))
plot(lm(distance~brand),which=c(1:2),add.smooth=F)
par(mfrow=c(1,1))
The first four lines are entering the distances a brand at a time (they don't need to be on separate lines, but that makes it easier to check). The next two lines enter the brands (10 of each) and puts them in the right format. The line beginning with anova gives the first two lines of the ANOVA table and the tapply line gives the mean of each groups. Finally, the last three lines gives the assumption plots like we would use for regression too.
Simple Linear Regression:
Example 11.2 on pages 590-595, 11.3 on page 605-606, 11.4 on pages 610-611, and 11.6 on page 622:
x<-c(1,2,3,4,5)
y<-c(1,1,2,2,4)
plot(x,y)
abline(lm(y~x))
lm(y~x)
summary(lm(y~x))
anova(lm(y~x))
The first two lines enter the data. The second two give the plot with the regression line added. The next three give the estimated beta0 and beta 1, the parameter estimates, and the first two lines of the ANOVA table. The R-squared is called "Multiple R-squared" in this output.
The two assumption plots can be gotten using:
par(mfrow=c(1,2))
plot(lm(y~x),which=c(1:2),add.smooth=F)
par(mfrow=c(1,1))
The confidence interval for the regression line and predication interval for a new observation for examples 11.7 and 11.8 on pages 628-629 can be gotten using:
predict.lm(lm(y~x),data.frame(x=4),interval="confidence")
predict.lm(lm(y~x),data.frame(x=4),interval="predict")
Contingency Tables:
*Goodness of Fit: Example 13.2 on pages 776-777:
obs<-c(39,99,336,26)
p<-c(.07,.18,.65,.10)
chisq.test(obs,p=p,correct=F)
chisq.test(obs,p=p,correct=F)$expected
The first chisq.test line gives the actual chi-square test results, the second gives the expected values that it uses.
*Homogeneity or Independence: Example 13.3 on pages 787
obs<-matrix(c(39,19,12,28,18,172,61,44,70,37),byrow=T,nrow=2)
chisq.test(obs,correct=F)
chisq.test(obs,correct=F)$expected
Note that the data is given a row at a time, not counting the totals.