Throughout this page, portions presented in "type-writer font" are designed for you to cut and paste them directly into the R window as you go through.
1: An Introduction
2: Statistical Analysis
3: Manipulating Data
4: Graphics
5: Psychometrics Packages
6: Function Writing
7: Sample Simulations
Part 1: An Introduction:
Getting Started:
R, like the commercial S-Plus, is based on the statistical programming language S. Unlike S-Plus, it can be downloaded for free from www.r-project.org [Currently, to do so, choose: CRAN, then a mirror site in the US, then Download R for Windows, then base, then "Download R 3.4.0 for Windows"]. This page also has links to FAQs and other information about R.
When you start R, you will see several standard looking menus along the top (File, Edit, etc...) and a white R Console window taking up the majority of the screen. This window is where you will both enter any data, commands, or "programs" that you wish to run, and also where the output will appear.
Because this window is used for so many things it often fills up quickly and so, if you are doing anything involving a large number of steps, it is often easiest to type them in a text editor first. There are several built in or designed to use with R. One option is to choose New Script under the File menu. You can then enter your commands in that window and run it by selecting the text and using Ctrl+R. You can also save those scripts for later use. We've also found it easy across platforms and time to use a really basic program like Notepad in Windows and then copy and paste the code over as needed.
Also very useful are the fact that in the main R Console window, the up and down-arrow on the key-board can be used to scroll through previously entered lines in R, and that history() will open a window of previously entered commands (which we'll see below after entering some). If the font in this R Console is too small, or if you dislike the color or font, you can change it by selecting GUI Preferences... under the Edit menu.
Objects:
At the heart of R are the various objects that you enter. An object could be data (in the form of a single value, a vector, a matrix, an array, a list, or a data frame) or a function that you created. Objects are created by assigning a value to the objects name using either <- or =. For example
x<-3
creates an object called x that contains the (real) number 3. If you simply cut and paste that command in at the > prompt, R will only give you another prompt. This is because you merely assigned the value... you didn't ask R to do anything with it. Typing
x
will now return the number 3, the value of x. R is case sensitive, so entering
X<-5
X
will create a seperate object. If you reassign an object, say:
X<-7
the original value is over-written. If you attempt to use the name of a built in function or constant (such as c, t, t.test, or pi) for one of your variable names, you will likely find that future work you are trying to do gives unexpected results. Notice in the name of t.test that periods are allowed in names of objects. Other symbols (except numbers and letters) are not allowed.
Note that the up-arrow and history() will now show us the commands we entered previously. This set of all of your created objects (your Workspace) is typically saved when you exit R (unless you choose otherwise, or don't have appropriate permissions to save it automatically in the directory where R is installed) and reloaded the next time you start R. You can also load and save the workspace in other locations using the options in the File menu.
Arithmetic and Parentheses:
X-x
7-3
will both return the value 4, one by performing the arithmetic on the objects, and the other on the numbers. The other basic mathematical operators are:
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ | exponentiation |
%*% | matrix multiplication |
R will often try to do the "common-sense" thing when using arithmetic arguments. For example, if Y is a vector or matrix of values, then Y+4 will add 4 to each of the values in Y. (So the vector 3,2,5 would become 7,6,9).
Parentheses work as usual in mathematical statements, but they do not imply multiplication.
X(x+5)
X*(x+5)
Notice that the former returns an error about looking for a function called X, while the latter does the arithmetic to return the value 40.
The other use of parentheses in R are to indicate that you attempting to run a function, and, if the function has any options it will contain those.
objects()
will return the list of all of the objects you currently have in R. Typing it without the parentheses will just show you the function instead of trying to run it! The command:
rnorm(10)
runs the function rnorm with the argument 10. In this case it is generating a random sample of 10 values from a normal distribution.
Help!:
To see this, we could run the help function on that command.
help(rnorm)
A shortcut, ?rnorm, would also have worked.
Every help file in R begins with a brief description of the function (or group of functions) in question, followed by all of the possible options (a.k.a. Arguments) that you can provide. In this case the value n (the number of observations) is required. Notice that all of the other options are shown as being = some value - this indicates the defaults those values take if you do not enter them. The sample we generated above thus has mean 0 and standard deviation 1.
Below the list of arguments are a brief summary of what the function does (called the Details), a list of the Value (or values) returned by running the function, the Source of the algorithm, and general References on the topic. See Also is often the most useful part of the help for a function as it provides a list of related functions. Finally, there are some Examples that you can cut and paste in to observe the function in action.
Functions:
It is always safest to enter the values of functions using the names of the arguments:
rnorm(10,sd=4)
rather than trusting that the argument you want happens to be first in the list:
rnorm(10,4)
Notice the former puts 4 in for the standard deviation, while the latter is putting it in for the second overall argument, the mean (as seen in the help file).
Typing objects() again we see that these values we generated have not been saved as an object, and exist solely on the screen. To save the values we could have assigned the output of our function to an object.
normal.sample<-rnorm(50)
normal.sample
A few common statistical functions include:
mean() | find the mean |
median() | find the median |
sd() | find the standard deviation |
var() | find the variance |
quantile() | find the quantiles (percentiles); requires the data and the percentile you want e.g. quantile(normal.sample,.5) is the median |
max() | find the maximum |
min() | find the minimum |
summary() | find the 5-number summary |
hist() | construct a histogram |
boxplot() | construct a boxplot |
qqnorm() | construct a normal quantile-quantile plot |
qqline() | add the line to a normal quantile-quantile plot |
Trying a few of these out (like mean(normal.sample)) will show us the descriptive statistics and basic graphs for a sample of size 50 from a normal population with mean 0 and standard deviation 1. (Using up arrow can make it quicker to try several in a row.)
As we will see in more detail later, it is possible to create your own functions by using the function function. This one creates a simple measure of skewness.
Skew<-function(x){
(mean(x)-median(x))/sd(x)}
Note that braces { } in R are used to group several separate commands together, and also occur when using programming commands like loops or if-then statements. They work the same as parentheses in arithmetic expressions.
After entering or new function, it works like any built in function, except that it appears in our objects list.
Skew
Skew()
Skew(normal.sample)
objects()
There are also a number of mathematical functions as well. Ones common in statistical applications include:
sqrt() | square root |
exp() | exponent (e to the power) |
log() | the natural logarithm by default |
abs() | absolute values |
floor() | round down |
ceiling() | round up |
round() | round to the nearest (even if .5) |
Vectors, Matrices, and Arrays:
The output from rnorm() is different from the X and x we created as it contains more than just a single value - they are vectors. While we can think of them as vectors in the mathematical sense, we can also think of a vector as simply listing the values of a variable.
Vectors in R are created using the c function (as in concatonate). Thus
Y<-c(3,2,5)
Y
Y+4
Y*2
creates a vector of length three (and we can verify that arithmetic works on it componentwise). Given two vectors arithmetic is also done componentwise:
Z<-c(1,2,3)
Y+Z
Y*Z
Other functions are also evaluated component-wise (if possible):
sqrt(Z)
Multiple vectors can be combined together by using the c function:
YandZ<-c(Y,Z)
YandZ
nums<-c(1,2,3)
nums
lttrs<-c("a","b","c")
lttrs
c(nums,lttrs)
c(nums,lttrs)+2
Once we have our desired vector, an element in the vector can be referred to by using square brackets:
YandZ[2]
YandZ[2:4]
YandZ[c(1,4:6)]
By using the c function, and the : to indicate a sequence of numbers, you can quickly refer to the particular portion of the data you are concerned with.
Matrices (two-dimensional) and arrays (more than two dimensions) work similarly - they use brackets to find particular values, and all the values in an array or matrix must be of the same type (e.g. numeric, character, or factor). In the case of matrices, the first values in the brackets indicates the desired rows, and the ones after the comma indicate the desired columns.
Xmat<-matrix(c(3,2,5,1,2,3),ncol=3,byrow=T)
Xmat
Ymat<-rbind(Y,Z)
Ymat
Xmat[1,2:3]
Zmat<-cbind(nums,lttrs)
Zmat
In the above code, matrix is the function to form a vector into a matrix, rbind places multiple vectors (or matrices) side-by-side as the rows of a new matrix (if the dimensions match), and cbind does the same for columns.
Data Frames, Lists, and Attributes:
In many cases the data set we wish to analyze will not have all of the rows or columns in the same format. This type of data is stored in R as a data frame. scdata.txt is one such data set, and it can be found in the directory of course materials (its description is found in scdata.pdf). The data can be read in directly from where you saved it on your hard-drive using code similar to the below (with the correct drive and directory substituted in).
sctable<-read.table("c:/RforQM/scdata.txt",head=T)
sctable
If you don't want to view the entire data set, you can use head and tail to see the first and last rows. This lets you check the variable names as well as the number of observations successfully read in.
head(sctable)
tail(sctable)
There are a large number of ways to access the rows and columns of a data frame, and one of them is to treat it similarly to a matrix.
County1<-sctable[1,]
Birth.Death<-sctable[,3:4]
This simplicity sometimes causes trouble though. While Birth.Death may look on the screen like it is a matrix, it is still a data frame and many functions which use matrix operations (like matrix multiplication) will give an error. The attributes function will show us the true status of our object (it returns NULL for a numeric vector and the dimensions if a matrix):
Birth.Death
attributes(Birth.Death)
BD.matrix<-as.matrix(Birth.Death)
attributes(BD.matrix)
The $ is used to access whatever corresponds to an entry in the names attribute:
Birth.Death$Births
This is particularly useful when trying to access a portion of the output of a function for later use. For example, later we will see a method of doing statistical inference called the t-test. In R, this is performed by the function t.test which can create a great deal of output on the screen.
t.test(normal.sample)
If you only want the part called the "p-value" for later use we can pull that out of the output.
t.out<-t.test(normal.sample)
t.out
attributes(t.out)
t.out$p.value
We could then save the resulting value as part of a vector or matrix of other p-values, for example.
The $ is also used to access named parts of lists, which we will see can be used to store a variety of kinds of information in a single object.
Packages:
The ability to "easily" write functions in R has lead to an explosion of procedures that researchers have written and made available to the public. Several of these are automatically included with the basic instillation of R. For example, the MASS package (or library) contains all of the functions corresponding to the Springer text Modern Applied Statistics with S by Venables and Ripley. The command library is used to activate a downloaded package and give access to all of its functions, such as fractions which finds a fraction to approximate any decimal you enter.
help(fractions)
library(MASS)
help(fractions)
fractions(0.5)
fractions(pi)
4272943/1360120
Additional packages can be found by using the Packages menu. Load package... contains a list of all of the automatically installed packages, and hundreds of others can be accessed using Install package(s)... if you are connected to the internet and have administrator rights on your computer.
This large number of available packages makes it very easy to conduct a huge variety of analyses, and impossible to know even a fraction of the possibilities off the top of your head.
For personal use, the source command can be used to read in a text file containing previously written functions that you have stored in a .txt file (perhaps on the internet for use by students).
Part 2: Statistical Analysis
There are a huge variety of statistical tests built in to R for analyzing data. In many cases the same function will perform several different tests and confidence intervals, and produce a large amount of output. Below we provide examples of many of these functions, including some that are part of other packages, others that are custom written functions.
Course Data: Most of the examples of the statistical methods in this section are carried out using the data set CourseData. The data is a stratified sample of 70 students from a large section course. The strata were based on the college the students belonged to (AS = Arts & Sciences, PM = Professional Management, MC = Mass Communications, and NU=Nursing) and their year in school (ranging from 1st to 3rd based on credit hours, and limited based on expectation of having at least 10 students from that college at that grade level). The response variables are their Hmwk = Homework Average and E1 to E3 = their grades on the first three exams.
If the data is read in using read.table and called students, then table(students[,1:2]) should verify the count of the students in each grouping. attach(students) will allow you to reference the variables simply by College, Year, Hmwk, E1, E2, and E3.
If you have attached the dataset, some useful subsets of the data can be created to give the different types of scores by group. For example:
E1.1NU<-E1[(College=="NU")&(Year==1)] #E1 for 1st year nursing students
E1.1AS<-E1[(College=="AS")&(Year==1)] #E1 for 1st year arts & sciences students
E3.1NU<-E3[(College=="NU")&(Year==1)] #E3 for 1st year nursing students
grp<-as.factor(paste0(Year,College)) #Put Year and College together
More detail on choosing subsets like this are covered in the upcoming section on Manipulating Data.
There are a variety of graphical tools we could use to get an overview of the data set, and we will see a number of them in the upcoming section on Graphics. Two of note include:
pairs(students[,3:6])
boxplot(E1~grp)
One-sample t-test and interval: Say we wish to test the null hypothesis that the mean score on E1 for 1st year nursing students is greater than the historical exam average of 75.
qqnorm(E1.1NU);qqline(E1.1NU)
t.test(E1.1NU,mu=75,alternative="greater")
Note that this produces a one-sided confidence bound because of the alternative selected To get the standard 95% interval in the solutions, run it without the alternative="greater" part (you don't need to specify "two.sided" becaus that is the default setting). The other option is "less". Using help(t.test) we can also see that the default null hypothesis is mu=0, the function can perform the two-sample tests, and conf.level controls the confidence level for the confidence intervals.
In some cases it is useful to be able to retrieve only part of the output from a test. For example, a simulation study might only want to use the p-values. The various values that can be extracted from a test can be viewed using the attributes() function.
t.out<-t.test(E1.1NU,mu=75,alternative="greater")
attributes(t.out)
Anything on the attributes list can then be accessed using the $. For example:
t.out$p.value
Many of the other statistical methods built into R share much in common with the t.test function.
The coverage of the different methods below is divided into several sections. Each of them assumes you have created and attached the data.frame students and created the objects E1.1NU, E1.1AS, E3.1NU, and grp.
One-sample t-test
Other Basic Hypothesis Tests
One-way ANOVA and Multiple Comparisons
Regression, Factorial ANOVA, and ANCOVA
Other Methods...
Two-sample t-test and interval: Test the null hypothesis that the mean on E1 for 1st year students from arts and sciences is less than the mean for 1st year students from nursing
par(mfrow=c(1,2))
qqnorm(E1.1AS);qqline(E1.1AS)
qqnorm(E1.1NU);qqline(E1.1NU)
par(mfrow=c(1,1))
t.test(E1.1AS,E1.1NU,alternative="less",var.equal=T)
The graphics command par(mfrow=c(1,2)) allows for the two qq-plots to be displayed at the same time in the window in a 1 row/2 column display. The (1,1) then returns it to a single graph for the next thing that is plotted.
var.equal=T specifies use of the equal variances assumption. The default is variances not equal (or use var.equal=F). Again, the confidence interval gotten using this code is the one-sided confidence bound. Use two.sided to get the confidence interval.
Paired t-test and confidence interval: Test that the mean for E3 is greater than the mean of E1 for first year nursing students.
#The correct paired test
qqnorm(E3.1NU-E1.1NU);qqline(E3.1NU-E1.1NU)
t.test(E3.1NU,E1.1NU,alternative="greater",paired=T)
The paired=T option makes it do the paired t-test. The output of courses is the same as conducting the one-sample t.test on the differences.
t.test(E3.1NU-E1.1NU,alternative="greater")
Chi-square test and interval for one variance: Test wheter the standard deviation of E1 is greater than 10 for the first year nursing students.
R doesn't have this test built in (which isn't that horrible since we should probably never do it... but it does make it odd that they have the F-test for two variances then.) In any case its pretty easy to write a function to do it. To analyze the data, cut and paste the function in, and then run it on your data. Or, it can be read in using source("c:/RforQM/TWRfns.txt").
chisquare.var<-function(y,sigma2=1,alpha=0.05){
n<-length(y)
chisquare<-(n-1)*var(y)/sigma2
pval.low<-pchisq(chisquare,df=n-1)
pval.hi<-1-pchisq(chisquare,df=n-1)
pval.not<-2*min(pval.low,pval.hi)
cilow<-(n-1)*var(y)/qchisq(1-alpha/2,df=n-1)
cihi<-(n-1)*var(y)/qchisq(alpha/2,df=n-1)
list(chisquare=chisquare,pval.for.less.than=pval.low,
pval.for.greater.than=pval.hi,pval.for.not.equal=pval.not,
ci.for.variance=c(cilow,cihi),ci.for.sd=c(sqrt(cilow),sqrt(cihi)))}
qqnorm(E1.1NU);qqline(E1.1NU)
chisquare.var(E1.1NU,sigma2=10^2)
Note that the confidence intervals here are sub-optimal in terms of length because they place an equal amount of area in each end.
F test for two variances: Test whether the variance of E1 for the first year nursing students is equal to the variance of E1 for the first year arts and sciences students.
par(mfrow=c(1,2))
qqnorm(E1.1NU);qqline(E1.1NU)
qqnorm(E1.1AS);qqline(E1.1AS)
par(mfrow=c(1,1))
var.test(E1.1NU,E1.1AS)
The graphics command par(mfrow=c(1,2)) allows for the two qq-plots to be displayed at the same time in the window in a 1 row/2 column display. The (1,1) then returns it to a single graph for the next thing that is plotted.
Two-sample Modified Levene's Test: Test of whether the variance of E1 for the first year nursing students is equal to the variance of E1 for the first year arts and sciences students.
R doesn't contain a built in function for the Modified Levene's test, but it can be carried out siply by running the two sample t.test on the right values.
t.test(abs(E1.1NU-median(E1.1NU)),abs(E1.1AS-median(E1.1AS)),var.equal=T)
Of course the means this is testing about are the means of the absolute deviation from the median. If that measure of spread is different between the two populations, then the variances should be different as well.
The packages car and lawstat have built in modified Levene's test functions, but they require having a response variable and a group variable (like in ANOVA). An easy function for the two sample case would be:
levene2<-function(data1,data2){
print("p-value for testing null hypothesis of equal variances")
t.test(abs(data1-median(data1)),abs(data2-median(data2)),var.equal=T)$p.value}
levene2(E1.1NU,E1.1AS)
One-way ANOVA and Multiple Comparisons:
One-way ANOVA: Test whether the means of E1 are equal for the seven populations of students.
The basic function for conducting an ANOVA (or any linear model) is lm:
lm(E1~grp)
The ~ indicates that you are specifying a model equation. The variable to the left of the tilde is the response variable, and the variable to the right is the predictor variable. (In multiple regression and more complicated ANOVA there can be multiple predictor variables).
Unfortunately the output from lm seems pretty meager - it is just the estimates of the paramters in the model equation. Using:
attributes(lm(E1~grp))
shows that there is a lot more behind the scenes. The functions anova, summary, and plot can be used to extract much of this information.
E1fit<-lm(E1~grp)
anova(E1fit)
summary(E1fit)
par(mfrow=c(1,2))
plot(E1fit,1)
plot(E1fit,2)
par(mfrow=c(1,1))
The 1 and 2 in the plot function request the first two graphs that are automatically produced when plotting the result of lm - the first is the residual vs. predicted plot, and the second is the qq-plot of the residuals. Leaving out the number will produce four plots, but only one at a time (you have to click on the graphics window to advance to the next one).
In some cases it is necessary to enter the data on your own, and then it is important to be sure that the group variable is a "factor". For example, imagine that you had the vector E1 entered, but did not have the vector of group memberships yet. We know however that the test scores go with different groups in sets of 10 (observations 1-10 are 1NU, 11-20 are 1PM, etc...). One way we could do a set of labels is:
grplabs<-c(rep(1,10),rep(2,10),rep(3,10),
rep(4,10),rep(5,10),rep(6,10),rep(7,10))
grplabs
The rep repeats the value in the first spot the number of times indicated in the second. Trying
lm(E1~grplabs)
gives a very different result from before though. Because the predictor variable is numeric, by default it attempted to do a regression instead of an ANOVA. Using:
grplabs<-as.factor(grplabs)
lm(E1~grplabs)
solves that problem. If we had entered the names of the various groups in quotation marks, instead of 1-7, then the grplabs would have been characters instead of numeric, and it would have worked without being factors (although some fother functions might give a warning message).
Modified Levene Test: Test whether the variances of E2 are equal for the seven populations of students.
The modified Levene test is available in the package car, so if that package has been downloaded it could be conducted using.
library(car)
leveneTest(E1,grp)
All Pairwise Comparisons - Tukey's HSD and Scheffe: Simultaneously construct confidence intervals around the differences in the mean of E2 between each pair of groups, or test the hypotheses of equality of the mean E2 score between each pair of groups.
R has a built in function TukeyHSD that will construct the intervals in the balanced case, and includes an adjustment for the slightly unbalanced cases.
TukeyHSD(aov(E2~grp))
The structure is a little inelegant (it uses aov, an alternate to anova on the original lm statement). The output is also not in the prettiest form... although that might be the only form that works if it is unbalanced.
As an alternative the function allpairs below will make the "prettier" display for either Tukey's HSD or Scheffe's method in the case where the different groups/treatments all have the same sample size. To run it, simply copy in the function as is, and then run it with the appropriate data.
#The following function performs all pairwise comparisons using # either Tukey's HSD ("Tukey") or Scheffe's method ("Scheffe"). # The function only needs to be copied in once. #NOTE: This function requires that all treatments have # equal sample size. allpairs<-function(y,treat,method="Tukey",alpha=0.05){ dat.reord<-order(treat) treat<-treat[dat.reord] y<-y[dat.reord] s2w<-anova(lm(y~treat))[2,3] t<-length(table(treat)) n<-length(y)/t df<-n*t-t qval<-qtukey(1-alpha,t,df) if (method=="Tukey"){stat<-qval*sqrt(s2w/n)} if (method=="Scheffe"){stat<- sqrt(2*s2w/n*(t-1)*qf(1-alpha,t-1,df))} chars<-c("A ","B ","C ","D ","E ","F ","G ","H ", "I ","J ","L ","M ","N ","O ","P ","Q ") means<-tapply(y,treat,mean) ord.means<-order(-means) treat<-treat[ord.means] means<-means[ord.means] grp<-1 current<-1 last<-0 lastsofar<-0 charmat<-NULL while(last< t){ newchar<-rep(" ",t) for (i in current:t){ if (abs(means[current]-means[i])< stat){ newchar[i]<-chars[grp] last<-i }} current<-current+1 if (last>lastsofar){ charmat<-cbind(charmat,newchar) grp<-grp+1 lastsofar<-last}} charmat<-apply(charmat,1,"paste",sep="",collapse="") list(Method=paste(method,", alpha=",as.character(alpha), sep="",collapse=""), Critical.Val=stat, Display=data.frame(Grp=charmat,Mean=means)) } allpairs(E2,grp) allpairs(E2,grp,method="Scheffe")
The built-in function pairwise.t.test will conduct all of the pairwise tests using the step-wise Bonferonni procedure. In the balanced case that will be very sub-optimal compared to Tukey's HSD. In the case where the decision to test is after examining the data then it won't have the protection of Scheffe.
Comparison to a Control - Dunnett's method: Simultaneously construct confidence intervals around the differences for the E2 means between the 1st year nursing students (the control) and all of the others, or test the corresponding hypotheses of equality.
There are several packages in R that contain ways of doing Dunnett's method... but they are all rather opaque. The following function carries it out when the data is balanced. Again, simply copy the entire function in, and then run it on your data. The value for the control is which of the treatments/groups is the control (the order it occurs if you do levels() on the group variable).
#The following function performs Dunnett's comparison with a control. # It requires that the MCPMod package be installed and that # you have run library(MCPMod). The default alternative is the # two-sided hypothesis. "greater" tests the alternate hypothesis that # the other treatments have a larger mean than the control. "less" # tests for smaller means. The function only needs to be copied in once. #NOTE: This function requires that all treatments have equal sample size. dunnett<-function(y,treat,control=1,alternative="two.sided",alpha=0.05){ dat.reord<-order(treat) treat<-treat[dat.reord] y<-y[dat.reord] s2w<-anova(lm(y~treat))[2,3] t<-length(table(treat)) n<-length(y)/t if (alternative=="two.sided"){alt<-TRUE} if (alternative!="two.sided"){alt<-FALSE} dval<-critVal(rbind(-1,diag(t-1)),rep(n,t),alpha=alpha,twoSide=alt) D<-dval*sqrt(2*s2w/n) comp<-NULL yimyc<-NULL sig<-NULL count<-0 for (i in ((1:t)[-control])){ count<-count+1 comp<-rbind(comp,paste(as.character(treat[i*n]),"-",as.character(treat[control*n]))) yimyc<-rbind(yimyc,mean(y[treat==treat[i*n]])- mean(y[treat==treat[control*n]])) sigt<-"" if (((yimyc[count,1])>=D)&(alternative!="less")){sigt<-"***"} if (((yimyc[count,1])<=(-D))&(alternative!="greater")){sigt<-"***"} sig<-rbind(sig,sigt) } out.order<-order(-yimyc) list(Method=paste("Dunnett, alternative=",alternative,","," alpha=", as.character(alpha),sep="",collapse=""), Critical.D=D,Differences=data.frame(Comparison=comp[out.order], Observed.Diff=yimyc[out.order],Significant=sig[out.order]))} levels(grp) #notice that 1st year nursing students are the 3rd group dunnett(E2,grp,control=3)
Specific Contrasts - Step-wise Bonferroni and Scheffe: Simultaneously test whether the mean E2 score of 1st year CAS students is different from the mean of the other 1st year students, whether the mean E2 score of 1st year CAS students is different from the mean score of the other years of CAS students, and whether the mean of E2 for the 1st year Mass Communication Students is different from the mean of the 1st year Professional Management Students
There are a number of (fairly opaque) functions in R to do contrasts. The function below works fairly nicely in the case where all the treatments are balanced. To run the function, simply copy it in once, and then enter the lines to test your particular set of contrasts.
#The following function estimates specific contrasts and adjusts them # by either using the step-down Bonferroni procedure or the Scheffe # adjustment. For the step-down Bonferroni the final p-value reported # is already adjusted and just needs to be compared to the alpha level. # For Scheffe, both the final p-value and a confidence interval (default # of 95%) are reported. The contrast matrix must be entered in a specific # format for the function to work (see the example below). The function # only needs to be copied in once. #NOTE: This function requires that all treatments have equal sample size. contrasts<-function(y,treat,control.mat,method="StepBon",conf.level=0.95,digits=4){ dat.reord<-order(treat) treat<-treat[dat.reord] y<-y[dat.reord] s2w<-anova(lm(y~treat))[2,3] t<-length(table(treat)) n<-length(y)/t ncontrasts<-nrow(control.mat) contrastmat<-matrix(as.numeric(control.mat[,2:(t+1)]), nrow=nrow(control.mat)) colnames(contrastmat)<-levels(treat) divisors<-as.numeric(control.mat[,(t+3)]) contrastd<-contrastmat/divisors cnames<-control.mat[,1] means<-tapply(y,treat,mean) L<-contrastd%*%means seL<-sqrt((s2w/n)*apply(contrastd^2,1,sum)) t.stat<-L/seL Unadj.p<-2*pt(-abs(t.stat),df=n*t-t) baseout<-data.frame(Contrast=cnames,contrastmat,Div=divisors) meth=method if (method=="StepBon"){ StepBon.p<-Unadj.p*rank(-Unadj.p) ord.un<-order(Unadj.p) for (i in 2:ncontrasts){ if (StepBon.p[ord.un[i]]<=StepBon.p[ord.un[i-1]]){StepBon.p[ord.un[i]]<- StepBon.p[ord.un[i-1]]} if (StepBon.p[ord.un[i]]>1){StepBon.p[ord.un[i]]<-1}} out<-data.frame(Contrast=cnames,l=round(L,digits), se=round(seL,digits),t=round(t.stat,digits),raw.p=round(Unadj.p,digits), stepBon.p=round(StepBon.p,digits))} if (method=="Scheffe"){ S<-seL*sqrt((t-1)*qf(conf.level,t-1,n*t-t)) Scheffe.p<-1-pf((abs(L)/(seL*sqrt(t-1)))^2,t-1,n*t-t) CL.low<-L-S CL.hi<-L+S out<-data.frame(Contrast=cnames,l=round(L,digits), se=round(seL,digits),t=round(t.stat,digits),raw.p=round(Unadj.p,digits), Scheffe.p=round(Scheffe.p,digits),S=round(S,4),CL.low=round(CL.low,4), CL.hi=round(CL.hi,4)) meth=paste(method,", conf.level=",as.character(conf.level),sep="",collapse="")} list(Method=meth,Definitions=baseout,Results=out) } levels(grp) #Note the levels order is "1AS" "1MC" "1NU" "1PM" "2AS" "2MC" "3AS" #Setting up the matrix of contrasts #---------------------------------- #first line will always look like this control.mat<-matrix(c( #'name of contrast',coefficient list,'divisor=',divisor value, '1AS vs. 1Other',3,-1,-1,-1,0,0,0,'divisor=',3, '1AS vs. 2+3AS2',2,0,0,0,-1,0,-1,'divisor=',2, '1MC vs. 1PM ',0,1,0,-1,0,0,0,'divisor=',1) #the end of the last row will always be this, although your nrow #needs to match the number of contrasts ,byrow=T,nrow=3) contrasts(E2,grp,control.mat,method="StepBon") contrasts(E2,grp,control.mat,method="Scheffe")
Regression, Factorial ANOVA, and ANCOVA:
Simple Linear Regression: Predict the E2 scores from the E1 scores.
R has a large number of functions for analyzing regression data, the most basic ones the same as those for one-way ANOVA.
#The scatterplot with the plot(E1,E2) abline(lm(E2~E1)) #The basic output Regfit<-lm(E2~E1) summary(Regfit) #The two standard residual plots par(mfrow=c(1,2)) plot(Regfit,1) plot(Regfit,2) par(mfrow=c(1,1)) #The plots of the residuals vs. other variables to # check independence. Looks like the errors might # have some dependence due to college. residuals<-Regfit$residuals par(mfrow=c(2,2)) plot(E3,residuals);lines(c(-1e10,1e10),c(0,0)) plot(Hmwk,residuals);lines(c(-1e10,1e10),c(0,0)) plot(College,residuals);lines(c(-1e10,1e10),c(0,0)) plot(Year,residuals);lines(c(-1e10,1e10),c(0,0)) par(mfrow=c(1,1))
The function SASreg below will produce output similar to SAS using just one function. It requires that you have loaded in and installed the car package. To use it, simply copy the function in once, and then apply it to your data set.
SASreg<-function(model){ regout<-lm(model) baseoutput<-anova(regout) k<-nrow(baseoutput)-1 Summary<-round(c(summary(lm(model))$sigma, summary(lm(model))$r.squared, summary(lm(model))$adj.r.squared),4) names(Summary)<-c("Root MSE","R square","Adj R-Squ") ANOVA<-rbind(apply(baseoutput[1:k,],2,sum), baseoutput[k+1,], apply(baseoutput[1:(k+1),],2,sum)) rownames(ANOVA)<-c("Model","Error","C Total") attributes(ANOVA)$heading<-attributes(ANOVA)$heading[1] ANOVA[1,3]<-ANOVA[1,2]/ANOVA[1,1] ANOVA[1,4]<-ANOVA[1,3]/ANOVA[2,3] ANOVA[1,5]<-1-pf(ANOVA[1,4],ANOVA[1,1],ANOVA[2,1]) attributes(ANOVA)$heading<-NULL TypeI<-baseoutput[1:k,] attributes(TypeI)$heading<-NULL TypeIII<-Anova(regout,type=3)[2:(k+1),c(2,1,3,4)] attributes(TypeIII)$heading<-NULL if (k==1){vifs<-1} else {vifs<-t(vif(regout))} ParEsts<-round(cbind(summary(regout)$coefficients,c(0,vifs)),4) colnames(ParEsts)[5]<-"VIF" par(mfrow=c(1,2)) plot(regout$fitted.values,regout$residuals,xlab="Fitted",ylab="Residuals") lines(c(min(regout$fitted.values)-1,max(regout$fitted.values)+1),c(0,0)) qqnorm(regout$residuals) qqline(regout$residuals) par(mfrow=c(1,1)) list(Model.Equation=model, Coefficients=regout$coefficients, Summary=Summary, Analysis.of.Variance=ANOVA, Type.I.Tests=TypeI, Type.III.Tests=TypeIII, Parameter.Estimates=ParEsts)} SASreg(E2~E1)
This output is perhaps a bit overkill for simple linear regression, but is very useful when performing multiple regression (or even ANCOVA or factorial ANOVA).
Box-Cox Transformation: To get the Box-Cox transformation for this data set we could use the boxcox function from the MASS library (which is automatically included with R and just needs to have library(MASS) to install:
#Assumes you've done library(MASS) boxcox(lm(E2~E1)) #This plot doesn't have the peak! #So try from 0 to 5 with spacing every .01 to see if its better boxcox(lm(E2~E1),lambda=seq(0,5,.01))
It looks like 1 is just outside the range, of the values that make the 95% cut-off, the one that is "nicest" and close to the peak is 2. So, the transformation y-squared is recommended.
Multiple Linear Regression: Predict the performance on Exam3 from the performance of the first two exams and the homework.
This can be done by mimicking the procedure for simple linear regression. In this case the model statment must be of the form E3~Hmwk+E1+E2. An interaction could be added by something like +E2*E3 if it were needed.
SASreg(E3~Hmwk+E1+E2)
The plots to help assess indepence (the residuals versus other variables not included in the model) and to check for needing higher order terms (the residual versus independent variables) could be generated as follows:
residuals<-lm(E3~Hmwk+E1+E2)$residuals #Sign of non-constant variance based on College? par(mfrow=c(1,2)) plot(College,residuals) plot(Year,residuals) par(mfrow=c(1,1)) #Plotting against group might be better plot(grp,residuals) #No signs of needing higher order terms par(mfrow=c(2,2)) plot(Hmwk,residuals) plot(E1,residuals) plot(E2,residuals) par(mfrow=c(1,1))
Variable Selection: Assuming we trusted the model fit, is there a subset of Hmwk, E1, and E2 that predicts E3 just as well as all three of them?
R has some packages that do all subsets variable selection, but the output is somewhat inelegant. The Cp function below is a bit prettier. It requires the leaps library, and that package must be installed for it to run It also requires that you put all of the predictor variables together in a matrix called X. To run it, first copy over the function (and use library(leaps)), construct the X matrix using cbind, and then run the function.
Cp<-function(X,Y){ baseout<-summary(regsubsets(X,Y,nbest=10)) inmat<-baseout$which[,-1] n<-nrow(inmat) namemat<-matrix(rep(colnames(X),n),nrow=n,byrow=T) namemat[!inmat]<-"" namemat<-cbind(rep(" ",n),namemat) nvars<-apply(inmat,1,sum) sets<-apply(namemat,1,paste,collapse=" ") for (i in 1:ncol(X)){ sets<-gsub(" "," ",sets)} out<-as.table(cbind(sets,round(baseout$cp,4), round(baseout$rsq,4),round(baseout$adjr2,4))) colnames(out)<-c("Variables","Cp","R square","adj-R2") rownames(out)<-nvars out} X<-cbind(Hmwk,E1,E2) Cp(X,E3)
It looks as if using Hmwk and just one of the other two exams just misses the general guideline for Mallow's Cp, and have slightly worse adjusted R-squared values.
Outlier Diagnostics: Are there examinees that are either extreme in terms of their predictor variables, have E3 badly predicted by the model, or significantly change the model?
R has the built in functions hatvalues, rstudent, dffits, and cooks.distance to help with outlier diagnsostics. The function outlier below puts them into one a single matrix.
outlier<-function(model){ baseout<-lm(model) outs<-cbind(hatvalues(baseout),rstudent(baseout), dffits(baseout),cooks.distance(baseout)) outs<-round(outs,4) colnames(outs)<-c("hii","ti","DFFITS","Cooks.D") outs} outlier(E3~Hmwk+E1+E2)
Prediction Intervals and Confidence Intervals for the Regression Line: What is the confidence interval for the mean E3 scores for examinees with Hmwk=95, E1=70, and E2=85? What would the prediction interval be?
The prediction intervals and confidence intervals for the regression surface can be gotten using the built in predict function. The following give those intervals for data points matching the original data (again using the model we defined above):
predict(lm(E3~Hmwk+E1+E2),interval="confidence")
predict(lm(E3~Hmwk+E1+E2),interval="predict")
For Hmwk=95, E1=70, and E2=85:
predict(lm(E3~Hmwk+E1+E2),data.frame(Hmwk=95,E1=70,E2=85),interval="confidence")
predict(lm(E3~Hmwk+E1+E2),data.frame(Hmwk=95,E1=70,E2=85),interval="predict")
Factorial ANOVA: Conduct a 2-way ANOVA for Hmwk based on Year 1 vs. 2 and AS vs. MC.
The data for this problem can be constructed using:
HmwkA<-Hmwk[((College=="AS")|(College=="MC"))&((Year==1)|(Year==2))]
CollegeA<-College[((College=="AS")|(College=="MC"))&((Year==1)|(Year==2))]
CollegeA<-as.factor(as.character(CollegeA))
YearA<-Year[((College=="AS")|(College=="MC"))&((Year==1)|(Year==2))]
YearA<-as.factor(as.character(YearA))
The basic output can again be constructed either using the built in functions or SASreg. The interaction is added using a colon if the terms are listed separately, or all the terms will be crossed using an asterisk.
summary(lm(HmwkA~CollegeA+YearA+CollegeA:YearA))
par(mfrow=c(1,2))
plot(lm(HmwkA~CollegeA+YearA+CollegeA:YearA),1)
plot(lm(HmwkA~CollegeA+YearA+CollegeA:YearA),2)
par(mfrow=c(1,1))
SASreg(HmwkA~CollegeA+YearA+CollegeA:YearA)
SASreg(HmwkA~CollegeA*YearA)
Profile plots can be constructed using the built in function interaction.plot. The first argument is the variable on the x-axis, and the third is the response variable.
interaction.plot(CollegeA,YearA,HmwkA)
interaction.plot(YearA,CollegeA,HmwkA)
ANCOVA: Predict E3 from Hmwk using grp as a covariate.
Again this can be performed using either the built in functions or SASreg.
summary(lm(E3~Hmwk+grp))
par(mfrow=c(1,2))
plot(lm(E3~Hmwk+grp),1)
plot(lm(E3~Hmwk+grp),2)
par(mfrow=c(1,1))
SASreg(lm(E3~Hmwk+grp))
Note that the VIFs should not be part of the output in SASreg when categorical variables are present, and this is currently a bug.
Checking whether the lines are parallel can be done simply by checking the interaction Hmwk:grp.
SASreg(lm(E3~Hmwk*grp))
Functions in base SAS for other commonly used methods include:
For Nonparametrics: wilcox.test, friedman.test, kruskal.test
For Categorical Data: binom.test, chisq.test,
glm( ,family=binomial("logit")),fisher.test, mantel.haen.test
Part 3: Manipulating Data:
Loading and Saving Data:
Previously, we read in the data set sctable:
sctable<-read.table("c:/RforQM/scdata.txt",head=T)
A table can be saved using:
write.table(sctable[1:3,],"c:/RforQM/scdata2.txt",quote=F)
If quote=F is not used, then any columns of characters will be saved with quotation marks around them. By default, the table headings were saved with this file (go check!). It is always good to view your file before opening to check if you have column headings or not.
Other commands for reading and writing data are read.delim, read.csv, write.csv, and read.fwf
mdatab1<-read.table("c:/RforQM/mdatab.txt",head=F)
mdatab1[1:2,]
mdatab2<-read.fwf("c:/RforQM/mdatab.txt",widths=rep(1,32))
mdatab2[1:2,]
If your data doesn't follow one of these easily used formats, then the function scan can also be useful.
Data Frames versus Matrices:
While tables are convenient for storing and viewing data sets (they can have multiple types of variables in them), many of the statistical functions require that the data to be in matrix form. It is often easiest to create these matrices immediately and avoid the error messages later.
scdata<-as.matrix(sctable[,3:27])
counties<-as.matrix(sctable[,1])
regions<-as.matrix(sctable[,2])
Subsets of both tables and matrices are referred to in the same manner, by using brackets, with the selected rows first, followed by the selected columns.
sctable[c(34,38),c(18:21)]
scdata[c(34,38),c(16:19)]
Producing these (apparently) same results required different columns because the matrix was missing the two columns of characters. It is often useful to examine a matrix or table that is too large to fit on the screen by selecting either the first few rows, or by checking its dimension:
scdata[1:2,]
dim(scdata)
Note that there are often several ways to refer to the variables that you want to work with.
sctable$Income
sctable[,"Income"]
sctable[,11]
scdata$Income
scdata[,"Income"]
scdata[,9]
Notice that the fourth option did not work. Checking the attributes of the data frame and the matrix demonstrates why:
attributes(sctable)
attributes(scdata)
The former has each column name listed under $names (and so you can access them using $, while the matrix does not.
These abbreviations can also be used when just asking for just a subset of rows and columns:
scdata[regions=="Upstate",c(8:10)]
It is possible to use logical operators to ask for only particular rows or columns of a matrix or data frame, or particular elements of a vector.
sctable[sctable$Income>40000,]
returns only those rows (the logical statement is before the comma) whose incomes are greater than 40000. Other logical operators include:
!= | is not equal to |
< | less than |
<= | less than or equal to |
>= | greater than or equal to |
& | and |
| | or |
The function order (and its relatives index and sort) are often useful as well:
sort(sctable$Income)
sctable[order(sctable$Income),c(1,2,11)]
While the matrix scdata inherited col names from the table used to create it, it currently lacks names for the rows.
rownames(scdata)<-sctable[,1]
upinc<-scdata[regions=="Upstate",9]
midinc<-scdata[regions=="Midlands",9]
upinc
midinc
In addition to assigning names to the rows and columns, new values can also be assigned over the previous value within the matrix:
Xmat
Xmat[1,2]<-99
Xmat
Applying Functions to a Matrix:
Many statistical procedures involve finding the value of a statistic seperately for each row or column of a matrix. One way to do this is to use a loop. If this method is used it is necessary to first create a vector for the calculated values to be stored in:
colmeans<-rep(0,ncol(scdata))
for (i in 1:ncol(scdata)){
colmeans[i]<-mean(scdata[,i])}
colmeans
The rep function repeats the first argument (0 in this case) the number of times equal to the second argument (the number of columns of scdata). Similar to ncol, the function nrow would return the number of rows in a matrix, and length returns the number of elements in a vector.
An easier way to arrive at the same result is to use the apply function:
apply(scdata,2,mean)
apply is read in reverse - the above code says to apply the mean to the columns (rows=1, columns=2) of scdata. Apply can also be used with functions that return more than a single value, for example:
apply(scdata,2,summary)
In this case, the output of the summary function is a table, and so apply simply bound all of the tables from each of the columns together. In other cases the output might be in the form of a list:
apply(scdata,2,t.test)
Lists:
Lists are perhaps the most free-form of R's various objects:
valuelist<-list(means=apply(scdata,2,mean),t.test.1=t.test(scdata[1,]))
valuelist
valuelist$t.test.1
There are a variety of ways to add new elements to a list. The safest is possibly to simply put it in with the name you want:
valuelist$t.test.2<-t.test(scdata[2,])
valuelist
Using the append command can often have unintended consequences:
append(valuelist,t.test(scdata[3,]))
This happened because the output of t.test (like many functions) was already a list, and so append simply concatonated the two lists.
An example::
The following example ties several of the above ideas together and also introduces the which function. Imagine that we have a set of grades from a course that we would like to convert to letter grades using a particular weighting and letter-grade cut-offs.
students<-read.table("c:/RforQM/Coursedata.txt",head=T)
head(students)
tail(students)
In this particular case we want the weighting to be:
20% to Hmwk(col 3)
25% to E1(col 4)
25% to E2(col 5)
30% to E3(col 6)
But those weights could change later. A function can be written to take the data, and the weights and calculate the weighted average.
course.grade=function(datain,wts,col.num){
x<-datain[,col.num]
final <- apply(x, 1, weighted.mean, wts)
return(final)
}
Final<-course.grade(datain=students, wts=c(.20,.25,.25,.30), col.num=3:6)
We can then add that score to the students data frame, and add a final column with the actual letter grades on a 90-80-70-60 scale.
students<-data.frame(students,Final)
head(students)
#Add the letter grades.
students[which(students$Final < 60),"Grade"] = "F"
students[which(students$Final >= 60 & students$Final < 69),"Grade"] = "D"
students[which(students$Final >= 70 & students$Final < 79),"Grade"] = "C"
students[which(students$Final >= 80 & students$Final < 89),"Grade"] = "B"
students[which(students$Final >= 90 ),"Grade"] = "A"
#Check that it worked
head(students)
Part 4: Graphics
The most basic of the graphics functions are plot, lines, text, and par. Almost everything can be done with these four functions and their many options. You can save the graphics by right-clicking on them or using the jpeg function (among others). Say, for example we want to plot High School Graduation Rate as a function of Income.
plot(sctable$Income,sctable$HSGrad)
A few things we might want to change here are the x and y labels and the main title...
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County")
Changing the symbols being plotted and their size is done by using the options pch and cex:
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
pch="*",cex=5)
The details of these and many other functions can be found using help(par). To add the names of the counties, we could use the text command:
text(sctable$Income,sctable$HSGrad,labels=counties)
Notice that this has the previous symbol underneath the (too-large) words. We can remove that using type="n" to not plot the points, and then use cex in the plot command to make the words smaller.
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
type="n")
text(sctable$Income,sctable$HSGrad,labels=counties,cex=0.8)
It isn't really appropriate for this plot at all, but we could add some lines over it if we wanted:
lines(sctable$Income,sctable$HSGrad)
Notice that it did it in the order the counties were in alphabetically, not by the order of the values!
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
type="n")
text(sctable$Income,sctable$HSGrad,labels=counties,cex=0.8)
lines(sort(sctable$Income),sort(sctable$HSGrad))
You shouldn't feel like you have to add texts and lines only at data points:
lines(c(35000,45000),c(65,65),lty=3,lwd=5)
text(40000,66,"A line with some more options")
The following demonstrates how we could color code this map by region of the state:
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
type="n")
text(sctable$Income[regions=="Upstate"],sctable$HSGrad[regions=="Upstate"],
labels=counties[regions=="Upstate"],col="orangered1")
text(sctable$Income[regions=="Midlands"],sctable$HSGrad[regions=="Midlands"],
labels=counties[regions=="Midlands"],col="firebrick4")
legend("bottomright", c("Upstate", "Midlands"),
col=c("orangered1", "firebrick4"),lty=c(1,1), lwd=c(1,1),
cex=.7)
A list of colors can be found in the file col.html on the CD. A few other useful tricks are shown below:
Displaying the graphs in 2 row by 1 column format:
par(mfrow=c(2,1))
Make the size of the plotting symbols be proportional with another variable and don't show the axes:
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
cex=3*sctable$PopDens/max(sctable$PopDens),xaxt="n",yaxt="n")
Make the axes a different size:
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
xlim=c(0,100000),ylim=c(0,100))
And plot the previous graph on top of this latest one to compare them:
par(new=T)
plot(sctable$Income,sctable$HSGrad,xlab="Average Income in Dollars",
ylab="High School Graduate Percentage",main="Graduate Rate versus Income by County",
cex=3*sctable$PopDens/max(sctable$PopDens),xaxt="n",yaxt="n")
par(mfrow=c(1,1))
A variety of specialized statistical functions also use plot appropriately, such as the cluster analysis function hclust:
plot(hclust(dist(apply(scdata[,1:19],2,function(x){(x-mean(x))/sd(x)}),
method="maximum"),method="complete"),labels=counties)
Other, more specialized graphical functions include: boxplot, hist, and piechart.
There are also several packages that will do three-dimensional graphics, including the lattice package that is automatically installed:
library(lattice)
cloud(sctable$Income~sctable$HSGrad*sctable$PopDens)
Part 5: Psychometric Packages
There are a number of packages that can be used for psychometric analysis. The following code uses a small sample of what is out there. A good referece with a list of numerous packages can be found at https://cran.r-project.org/web/views/Psychometrics.html. This section contains some examples from the CTT, psychometric, psych, difr, and mirt packages.
The first several examples use the data set samptest.txt. It also gives a chance to see how to score and exam using the CTT package. The next lines enter the answer key, format it for later use, and read in the data set.
#here is sample test data with 36 items and 2 demographic variables (ethnicity and gender)
key="ADCABCACDCDCBBCBDCADADCABDBABADBCADB"
new.key<-unlist(strsplit(key, split=""))
testdat<-read.fwf("C:/RforQM/samptest.txt", widths=c(rep(1,38)))
dim(testdat);head(testdat)
The CTT package contains basic data functions like distractor analysis, and scoring the data.
library(CTT)
distractor.analysis(testdat[,1:36], key=new.key, p.table = T)
#score gives both the sum scores and the response matrix
scoredat <- score(testdat[,1:36],new.key, output.scored=TRUE)
#get 0/1 responses only
responses<-scoredat$scored
#get the total scores only
sumscores<-scoredat$score
It also includes some basic classical analyses:
reliability(responses)
The Psychometric and psych packages overlaps some, but are both larger.
library(psychometric)
alpha(responses)
alpha.CI(alpha(responses), N=dim(responses)[1], k=dim(responses)[2])
library(psych)
polyserial(as.matrix(sumscores),responses[,1])
It is important to check that the functions in each package actually do what you want. In the case of measuring item discrimination using the biserial correlation of the items with the rest score, it's necessary to check that the polyserial funcion in the psych package is the one you want to agree with texts like Lord and Novick, or Crocker and Algina.
biserials<-NULL
for (i in 1:ncol(responses)){
biserials<-c(biserials,polyserial(as.matrix(sumscores-responses[,i]),as.matrix(responses[,i])))}
biserials
Among the more specialized packages are those that do more specialized procedures such as DIF detection in the difR package.
#Looking at difR
library(difR)
difdat=data.frame(responses,testdat[,38])
dichoDif(difdat, group = 37, focal.name = "F", method = c("MH","Logistic"), criterion="Wald")
There are also packages for estimating IRT models, icluding the mirt package. A larger data set is used for this example.
mdatab2<-read.fwf("c:/RforQM/mdatab.txt",widths=rep(1,32))
dim(mdatab2)
head(mdatab2)
#mirt package
library(mirt)
#estimating IRT model parameters
(mod1.rasch <- mirt(mdatab2, 1, itemtype = 'Rasch'))
(mod1.3PL <- mirt(mdatab2, 1, itemtype = '3PL'))
coef(mod1.rasch)
coef(mod1.3PL)
The function has several different fitting methods, and allows spcification of prior distributions. The package also has a simdata function for simulating item response data. As with the classical test theory item discrimination example above, it's necessary to check how the items are parameterized (do they use 1.7, for example).
Part 6: Function Writing
The heart of R's popularity with statisticians is the intuitive way that new functions can be created. The core of defining a function is simply <-function{ }. Whatever appears on the last line of code within the braces is the argument that is returned to the user.
Consider wanting to create a function to conduct the chi-square test for whether a variance is larger than a specified value. [For a random sample from a normal population, (n-1) times the ratio of the sample variance to population variance has a chi-square distribution with (n-1) degrees of freedom -- this hypothesis should be rejected if it is greater than the critical value.] The input should be the data and the null hypothesis value. We would like the test to report back the test statistic and the p-value.
onevar.test<-function(x,var0){
n<-length(x)
teststat<-(n-1)*var(x)/var0
pval<-1-pchisq(teststat,df=n-1)
list(test.statistic=teststat,p.value=pval)}
vardata<-rnorm(20,mean=0,sd=10)
onevar.test(vardata,var0=10^2)
onevar.test(vardata,var0=5^2)
Functions can also be used to construct graphical displays (often greatly reducing the amount of time spent in constructing repetetive reports). The following function takes a vector of means and their corresponding vector of standard deviations, and plots all of the probability density functions on the same graph.
plot.norm<-function(mvec=0, sdvec=1){
maxy<-max(pnorm(mvec))
minx<-min(mvec)-3.2*max(sdvec)
maxx<-max(mvec)+3.2*max(sdvec)
x<-seq(minx,maxx,.1)
color<-1:length(mvec)
plot(x, dnorm(x),type='n',xlab="x", ylab="probability", xlim=c(minx,maxx), ylim=c(0,maxy))
title("Plots of Normal Curves")
for (i in 1:length(mvec)){
lines(x,dnorm(x,mean=mvec[i],sd=sdvec[i]), col=color[i])}
}
mvec<--2:2
sdvec<-c(2,1,.5,1,2)
plot.norm(mvec=mvec,sdvec=sdvec)
plot.norm()
Part 7: Sample Simulations:
Verifying Sampling Distributions:
We know that if the sample is taken from a normal distribution, that the sampling distribution of t=(xbar-mu)/(s/sqrt(n)) is a t-distribution with n-1 degrees of freedom.
In the code below we first take 10,000 samples of size 20 from a standard normal distribution (which of course has mean 0). We can then calculate the 10,000 t-statistics, one for each sample. Finally we can compare the histogram of those values to the reference t-distribution with 19 degrees of freedom.
z<-matrix(rnorm(10000*20),ncol=20) mu<-0 xbar<-apply(z,1,mean) s<-apply(z,1,sd) t<-(xbar-mu)/(s/sqrt(20)) hist(t,freq=F,ylim=c(0,0.45),nclass=50) vals<-(-100000:100000)/1000 f<-dt(vals,df=19) lines(vals,f,col="red",lwd=2)
On the other hand, a chi-square distribution with 5 degrees of freedom is very skewed right, with mean 5 and variance 10.
chisq5<-rchisq(1000,df=5) par(mfrow=c(1,2)) hist(chisq5) qqnorm(chisq5);qqline(chisq5) par(mfrow=c(1,1)) mean(chisq5) var(chisq5)
Repeating the above simulation for with the chi-square instead of the normal, (xbar-mu)/(s/sqrt(n)) is still fairly close to the reference t distribution!
z<-matrix(rchisq(10000*20,df=5),ncol=20) mu<-5 xbar<-apply(z,1,mean) s<-apply(z,1,sd) t<-(xbar-mu)/(s/sqrt(20)) hist(t,freq=F,ylim=c(0,0.45),nclass=50) vals<-(-100000:100000)/1000 f<-dt(vals,df=19) lines(vals,f,col="red",lwd=2)
A second result is that if the data comes from a normal distribution, then (n-1)s-squared/sigma-squared has a chi-square sampling distribution with n-1 degrees of freedom. When the population is normal, we can see how the relationship holds.
z<-matrix(rnorm(10000*20),ncol=20) sig2<-1 s2<-apply(z,1,var) X2<-(20-1)*s2/sig2 hist(X2,freq=F,ylim=c(0,0.08),nclass=50) vals<-(-100000:100000)/1000 f<-dchisq(vals,df=19) lines(vals,f,col="red",lwd=2)
Not so much when the population was chi-squared.
z<-matrix(rchisq(10000*20,df=5),ncol=20) sig2<-10 s2<-apply(z,1,var) X2<-(20-1)*s2/sig2 hist(X2,freq=F,ylim=c(0,0.08),nclass=50) vals<-(-100000:100000)/1000 f<-dchisq(vals,df=19) lines(vals,f,col="red",lwd=2)
Testing equality of two variances:
When both populations are normally distributed, then the F-test for two variances should perform very well. In this simulation we can examine whether the test maintains its nominal alpha-level. First we simulate 10,000 pairs of samples of size 10. We conduct the F test on each of these 10,000 data sets, and record the 10,000 p-values. Note that this is a bit complicated because the f-test uses t-distributions. To facilitate it a helper function (var.sim) takes in the column bound data, does the F-test, and just returns the p-value. In any case, it works fairly well when the populations are normal; the distribution of p-values looks uniform and roughly 5% are less than 0.05.
n<-10 nsims<-10000 x1<-matrix(rnorm(n*nsims),ncol=n) x2<-matrix(rnorm(n*nsims),ncol=n) fulldat<-cbind(x1,x2) var.sim<-function(fdat,n1,n2){ var.test(fdat[1:n1],fdat[(n1+1):(n1+n2)], alternative="greater")$p.val} pvals<-apply(fulldat,1,var.sim,n1=n,n2=n) hist(pvals) sum(pvals<=0.05)/nsims
This is not the case when the populations take other distributions. In this case a (very heavy tailed) t distribution with 3 degrees of freedom:
n<-10 nsims<-10000 x1<-matrix(rt(n*nsims,df=3),ncol=n) x2<-matrix(rt(n*nsims,df=3),ncol=n) fulldat<-cbind(x1,x2) var.sim<-function(fdat,n1,n2){ var.test(fdat[1:n1],fdat[(n1+1):(n1+n2)], alternative="greater")$p.val} pvals<-apply(fulldat,1,var.sim,n1=n,n2=n) hist(pvals) sum(pvals<=0.05)/nsims
Multiple Comparisons Simulation:
In many cases statistical results concerning type I error rates (α) and power are too difficult to calculate exactly (either beyond the level of the class one is teaching/taking, or beyond the means of current computational power). In this case a simulation study can provide an "easy" approximation to the results.
One example of this is the issue of multiple comparisons - conducting a large number of hypothesis tests in the search for a statisticaly significant result. Consider the case of researching what influences students to perform well on a state assessment by using 100 students about which we have a great deal of information - say 50 individual demographic variables. We can imagine a researcher who naively conducts the 50 different hypothesis tests (one for each demographic variable) to see if the test scores are predicted by any the individual demographic variables. To examine how such multiple comparisons impact type I errors, we want to examine the situation where the demographic variables in the population are actually unrelated to the student population.
To demonstrate this, let's make up some random test scores and random possible covariates. For the sake of convenience we will arbitrarily assume the various statistics are normally distributed, but other distributions could be used as well.
First we generate the 100 Test Scores with a mean of 50 and standard deviation of 10
testscores<-rnorm(100,50,10)
One method of simulating the 50 demographic variables would be to create separate variables for each one:
demo1<-rnorm(100,50,10)
demo2<-rnorm(100,50,10)
# etc...
This would be painfully time consuming both to generate and to analyze later. Instead we can store the various simulated demographic variables in a matrix (which could also allow us to use the apply function later). One way of doing this is to use a loop:
demo<-matrix(0,ncol=50,nrow=100)
for (i in 1:50){
demo[,i]<-rnorm(100,50,10)}
Since all 5000 (50*100) variables have the same distribution, we could also have simply done:
demo<-matrix(rnorm(100*50,50,10),nrow=100)
demo
There are several ways to look for a statistically significant relationship between the testscores and demographic variables. One of them is to use a test for correlations, which is done in cor.test. For a single demographic variable this could be done using:
cor.test(testscores,demo[,1])
To judge statistical significance though, we only need the p-value. Checking, we can find the code required to retrieve just that.
cor.out<-cor.test(testscores,demo[,1])
attributes(cor.out)
cor.out$p.value
An easy to follow way of calculating these for every column of demo would be:
pvals<-rep(0,50)
for (i in 1:50){
pvals[i]<-cor.test(testscores,demo[,i])$p.value}
pvals
Counting how many of these are less than a given α can be done then by using:
sum(pvals<=0.05)
Of course finding a few significant variables in a single data set shouldn't necesarily set off any alarm bells... after all, all significance tests have a chance of a type I error. Maybe this was the unlucky 5%.
But now that the basic steps are worked out, they can be put together in a function that allows the user to specify the number of students, the number of demographic variables, the alpha-level and the number of simulations to be run.
rejection.count<-function(nsims,nstudents,ndemo,alpha){
rejectcount<-rep(0,nsims)
for (j in 1:nsims){
testscores<-rnorm(nstudents,50,10)
demo<-matrix(rnorm(nstudents*ndemo,50,10),nrow=nstudents)
for (i in 1:50){
pvals[i]<-cor.test(testscores,demo[,i])$p.value}
rejectcount[j]<-sum(pvals<=alpha)
}
rejectcount
}
results<-rejection.count(100,100,50,.05)
table(results)
It is now easy to see that well over 5% of the different data sets resulted in rejection - almost every study that conducts a large number of hypothesis tests, and doesn't adjust for that fact, will find something "significant"!
Changing α to .05/50 (Bonferroni's correction) leads to very different results in terms of this overall type I error rate. A follow-up study could investigate the effect that it would have on the power though...