STAT 702

The Basics of R

Ch.1 #17: Plotting Acceptance Probabilities

Ch.2 #67: Plotting PDFs

Lecture 23: Expected Number of Tests Needed

Lecture 23: Benefit of Stratifying

STAT 703

Lecture 3: MoM, QQ Plots, and Parametric Bootstrap

Lecture 4: Maximum Likelihood Estimation

Homework 2: MLEs for Lotistic-type Regression

The Basics of R:

R, like the commercial S-Plus, is based on the programming language S. It can be downloaded for from www.r-project.org [At home, after reading the instructions!, choose: CRAN, then a mirror site in the US, then Windows (95 and later), then base, then R-2.3.1-win32.exe]. This page also has links to FAQs and other information about R.

One of the nice things about R (besides the fact that it's powerful and free) is that you can save everything you've done after you've finished each time. If you are using a network machine where you don't have access to the C drive, you can manually load and save the .Rdata file on another dirve each time you want to start and end the program. You would, of course, have to save the .Rdata file on disk if you wanted to use it both at home and at school.

At the heart of R are the various objects that you enter. At any
time (when you are in R)
you can type in the function `objects()` to see which ones you have
entered previously.
Presumably there aren't any there right now.
R also has many built in objects, most of which are functions. Because these
objects are always there, it will not list them when you type `objects()`.

A list of the various functions for R can be found in the manuals that are
installed with the program. The various manuals can be found under the
`Manuals` submenu of the `Help` menu. In particular, the *R Reference Manual* lists not only all the functions in the base package of R,
but also some of those in the supplementary packages as well.
Because this manual
is nearly 1000 pages long, you probably shouldn't print out a copy!!!!

The various supplementary packages are simply groups of functions that others
wrote and decided to make available to the public. A list of those that
you downloaded can be found under the `Packages` menu by choosing
`Load Package`. The one that we will use most often is the
`MASS` package.
It can be loaded in by typing
`library(MASS)`.
Note that some of the non-standard libraries need to be downloaded and
unzipped somewhere first if you don't have access to your C drive.

Once you know the name of a function, you can get help about it simply
by typing `help(functionname)`. In R, parentheses indicate either
mathematical grouping, like they usually do, or that you are applying a
function. Braces `[ ]` indicate that you are finding an element in
a string.

This brief example shows some of R's flexibility.

`x<-c(5,4,3,2)`

The `<-` is the command that assigns what ever is on the left to
the name that is on the right. The `c` indicates that an array
or vector is going to be entered. Simply typing `x` now will
return those four values. Typing

`x[3]`

will return the third value,

`mean(x)`

will return the mean, and

`objects()`

will show
you that you now have an object called `x`.

R also has a variety of graphical functions and statistical functions built in.

`hist(x)`

will construct a histogram, and

`t.test(x,alternative="greater",mu=5)`

will test the null hypothesis mu=5 vs. the alternate that mu > 5. (If you happen to mistype a command and get an error, just hit the up error to bring it up again for editing).

Besides just getting the result of a function, we can store the results somewhere. For example

`x.t<-t.test(x,alternative="greater",mu=5)`

will store the result of the t-test in x.t.

`attributes(x.t)`

will
then show you the various properties of `x.t`.
Thus,

`x.t$p.value`

will return the p-value.

R can also be used to write your own functions. For example, if
you wanted a function that returned sin(x)+x^{2} we could write:

`weird<-function(x){
y<-sin(x)+x^2
return(y)}
`

Try `weird(5)` and `weird(x)`. What is it doing?

One of the things we will be doing on some homework assignments is using R to generate random numbers following various discrete and continuous distributions, and seeing how they behave. Try:

`
ndat<-rnorm(200,5,10)
chidat<-rchisq(200,4)
par(mfrow=c(2,1))
hist(ndat)
hist(chidat)
`

Ch. 1 #17: Plotting Acceptance Probabilities

In R it is easy to construct a scatterplot of two sets of points against each other. Consider:

X 12 -12 18The vectors would be entered using the code:

Y 16 6 21

x<-c(12,-12,18)A basic plot would then be made using

y<-c(16,6,21)

plot(x,y)A few options are setting the limites of the x and y axis (

plot(x,y,xlim=c(-20,20),ylim=c(0,25),xlab="variable 1",ylab="variable 2",type="l")To fix this we can use the

x1<-sort(x) y1<-y[order(x)] plot(x1,y1,xlim=c(-20,20),ylim=c(0,25),xlab="variable 1",ylab="variable 2",type="l")Now the homework assignment gives that the x values (the p's) are: 0, 0.05, 0.10, 0.15, 0.2, and 0.25. But now we need to find the y values!

__Treating it as a binomial:__
The following code will calculate the probability we examined in class
where we wanted to know the chance of seeing exactly 10 defectives out of
20, where the probability of a defective is 0.02.

choose(20,10)*(0.02)^10*(1-0.02)^(20-10)This is the same as the built in function

dbinom(10,20,.02)should return the same value. If we wanted to get this value for a bunch of different probabilities we could try something like:

dbinom(10,20,c(0,0.05,0.10,0.15,0.2,0.25))or

ps<-c(0,0.05,0.10,0.15,0.2,0.25) ps dbinom(10,20,ps)We could also get the probability of exactly 10 or 11 by adding two of these vectors together:

psfor10<-dbinom(10,20,c(0,0.05,0.10,0.15,0.2,0.25)) psfor11<-dbinom(11,20,c(0,0.05,0.10,0.15,0.2,0.25)) psfor10or11<-psfor10+psfor11 psfor10or11or to get the probability that NOT exactly 10 or 11:

1-psfor10or11And we could plot it...

plot(ps,1-psfor10or11,xlim=c(0,0.25),ylim=c(0,1),xlab="p",ylab="prob",type="l")Not a very exciting plot for this example! Making

__Treating it as a Hypergeometric:__
It is also possible (and closer to the problems actual phrasing) to treat
this problem as a hypergeometric instead of a binomial. As an example,
consider the problem with the 0.02 defective rate. Imagine that there
were originally 2000 parts of which 40 were defective (40/2000=0.02).
The box of 20 we are looking at had 20 randomly chosen from that 1,000.

This would then be a hypergeometric with population size 2000, number of "successes" (defective in this case) 40, and number chosen 20. The probability of getting exactly 10 defectives out of the twenty would then be:

choose(40,10)*choose(2000-40,20-10)/choose(2000,20)from the formula at the top of page 13 or the middle of page 40.

This is the same as the built in function `dhyper`.
`dhyper` must be given three values:
the number of "successes" chosen, the number of possible
"successes" overall, the number of possible "failures" overall,
and the total sample size we chose. So

dhyper(10,20,1980,40)should return the same value. If we wanted to get this value for a bunch of different probabilities we just need to adjust them to be the number out of 2000 instead of just percents.

ps<-c(0,0.05,0.10,0.15,0.2,0.25) numbad<-2000*ps numgood<-2000-numbad dhyper(10,numbad,numgood,20)We could also get the probability of exactly 10 or 11 by adding two of these vectors together:

psfor10<-dhyper(10,numbad,numgood,20) psfor11<-dhyper(11,numbad,numgood,20) psfor10or11<-psfor10+psfor11 psfor10or11or to get the probability that NOT exactly 10 or 11:

1-psfor10or11And we could plot it...

plot(ps,1-psfor10or11,xlim=c(0,0.25),ylim=c(0,1),xlab="p",ylab="prob",type="l")Not a very exciting plot for this example! Making

Ch. 2 #67: Plotting PDFs

The code below is from in class on September 23rd, where we plotted the pdf for a variety of gamma distributions.

par(mfrow=c(2,2)) t<-(0:2500)/100 ga_t<-dgamma(t,shape=1,rate=1) gb_t<-dgamma(t,shape=2.5,rate=1) gc_t<-dgamma(t,shape=5,rate=1) gd_t<-dgamma(t,shape=10,rate=1) plot(t,ga_t,xlim=c(0,25),type="l") plot(t,gb_t,xlim=c(0,25),type="l") plot(t,gc_t,xlim=c(0,25),type="l") plot(t,gd_t,xlim=c(0,25),type="l") t<-(0:2500)/100 ge_t<-dgamma(t,shape=2.5,rate=0.25) gf_t<-dgamma(t,shape=2.5,rate=0.5) gg_t<-dgamma(t,shape=2.5,rate=2) gh_t<-dgamma(t,shape=2.5,rate=4) plot(t,ge_t,xlim=c(0,25),type="l") plot(t,gf_t,xlim=c(0,25),type="l") plot(t,gg_t,xlim=c(0,25),type="l") plot(t,gh_t,xlim=c(0,25),type="l")The

See `help(dweibull))` for information on the Weibull
distribution... and make sure to match up the parameters!

Lecture 23: Expected Number of Tests Needed

The following code is from Lecture 23 and is designed to calculate
the expected number of tests for various numbers of groups (see Example C
on page 121). `n` is the overall number of samples, `m` is
the number of groups it is divided into, and `p` is the probability
of any individual being "infected".

expntests<-function(n,m,p){ m*((1-p)^(n/m)+(n/m+1)*(1-(1-p)^(n/m))) } n<-5000 p<-.001 ms<-120:160 cbind(120:160,expntests(n=n,m=ms,p=p))

Lecture 23: Benefit of Stratifying

The following code is from Lecture 23 and is designed to show the
benefit of stratified sampling over simple random sampling. Consider
sampling from a population of size `N` with strata of size `Ni`.
The population overall has proportion `p` of the desired trait, while
each level of stratum has proportion `pi`. Compare the variance
of an overall sample of size `n` to stratifyed samples of size
`ni` from each stratum.

For example consider a population of size 150 with prevalence 0.5 that could be broken into strata of size 50 with prevalences 0.1, 0.5, and 0.9. Compare taking a sample of size 30 from the overall population to taking samples of size 10 from each strata.

N<-150 Ni<-c(50,50,50) p<-0.5 pi<-c(0.1,0.5,0.9) n<-30 ni<-c(10,10,10) v<-(p*(1-p)*(N-n)/(N-1))/n vstrat<-sum((Ni/N)^2*pi*(1-pi)*(Ni-ni)/(Ni-1)/ni) v vstratIn this case we see that the variance using the stratified sampling (0.0039) is less than the variance using a simple random sample (0.0067). Changing the various values can show that the benefit disappears if the strata levels aren't actually different in terms of their prevalences.

Lecture 3: MoM, QQ Plots, and Parametric Bootstrap

The following code is from class on January 18th. The first step was to
load the data set into R, and to then pick off the first column. I called
it `pars` as short for parametrs, and saved it as the vector `a`.
I sorted as well to make it easier to look at. `c` is a reserved
name, so you shouldn't call a vector that (maybe `cc`?).

`
pars<-read.table("http://www.stat.sc.edu/~habing/courses/703/itests.txt",head=TRUE)
a<-sort(pars[,1])
`

This code calculates the mean and estimated variance from the data. Notice that the formulas we are using divide by n and not n-1 so we need to adjust what R is calculating. I then calculated the method of moments estimators (I also had it find the number of observations since I would need that later)

`
n<-length(a)
xbar<-mean(a)
sigma2hat<-(n-1)/n*var(a)
lhat<-xbar/sigma2hat
ahat<-xbar^2/sigma2hat
`

The q-q plot is made using the `qgamma` (quantiles of the gamma)
function. Other distributions have similarly named functions. Recall from
last semester that are lambda is the rate parameter (=1/scale) in R. The
line on the q-q plot is drawn using the quantiles of the gamma.

`
plot(sort(a),qgamma((1:n)/(n+1),shape=ahat,rate=lhat))
lines(qgamma((1:n)/(n+1),shape=ahat,rate=lhat),qgamma((1:n)/(n+1),shape=ahat,rate=lhat))
`

Finally, here is the code for the estimating 100,000 bootstrap samples
for this example. Try `summary(lhat.dist)` or `sd(lhat.dist)`
or `hist(lhat.dist)` to see the properties of lambdas sampling
distribution.

`
nsamples<-100000
x<-rgamma(n*nsamples,shape=ahat,rate=lhat)
x<-matrix(x,ncol=n)
xbar.dist<-apply(x,1,mean)
s2h.dist<-(n-1)/n*apply(x,1,var)
lhat.dist<-xbar.dist/s2h.dist
ahat.dist<-xbar.dist^2/s2h.dist
`

Lecture 4: Maximum Likelihood Estimation

The code below uses R to find the maximum likelihood estimates
for the example above. The idea is to use
the built in function for finding extrema called `optim` that
finds minimums, and to enter in the log-likelihood function manually.
Of course we want the maximum, and not the minimum so we need
to have `ngamloglike` be the negative of the log-likelihood.

`optim` is set up so that the first argument `pars` is
what we want estimated, and the second argument `data` is what
we know. You should be able to check that the following code
contains the function for the negative log-likelihood from page 256,
and then asks R to use the optim, starting with 6 for the initial
`a` guess, and 6 for the initial `l` guess.

`
ngamloglike<-function(pars,data){
a<-pars[1]
l<-pars[2]
n<-length(data)
-1*
(n*a*log(l)+(a-1)*sum(log(data))-
l*sum(data)-n*lgamma(a))
}
optim(c(6,6),ngamloglike,data=a)
`

Notice that the estimates are alpha-hat=7.338857 and lambda-hat=6.854916.

We can also apply the bootstrap to these estimates to see if it appears we really have recovered something near the maximum of the sampling distribution.

`
amle<-7.338857
lmle<-6.854916
nsamples<-100000
x<-rgamma(n*nsamples,shape=amle,rate=lmle)
x<-matrix(x,ncol=n)
xbar.dist<-apply(x,1,mean)
s2h.dist<-(n-1)/n*apply(x,1,var)
lmle.dist<-xbar.dist/s2h.dist
amle.dist<-xbar.dist^2/s2h.dist
hist(lmle.dist,nclass=50,xlim=c(0,25))
lines(c(mean(lmle.dist),mean(lmle.dist)),c(-1000,30000),lwd=4,lty=5)
text(14,8000,"Mean of Estimates")
lines(c(lmle,lmle),c(-1000,30000),lwd=4)
text(3,8000,"True Value")
`

Homework 2: MLEs for Logistic-type Regression

The following code reads in the data concerning the temperature and the failure of the space shuttle o-rings before the Challenger disaster. The first column contains the tempearture, and the second whether an O-ring failed or not -- 1 indicates it worked successfully and a 0 indicates a failure.

oring<-read.table("http://www.stat.sc.edu/~habing/courses/703/oring.txt",head=TRUE) oring<-as.matrix(oring)The function

`
lr.nloglik<-function(pars,data){
a<-pars[1]
b<-pars[2]
s<-1/b
m<--a*s
x<-data[,1]
y<-data[,2]
p<-plogis(x,m,s)
-sum(y*log(p)+(1-y)*log(1-p))
}
optim(c(0,.1),lr.nloglik,data=oring)
`

This can be compared to the results gotten using the built in `glm`
function. The general formaty is `glm(y~x,modeltype)` where we are
using `x` to predict `y` and modeltype is appropriate for
the kind of model we are fitting (`binomial` in this case).

`
glm(oring[,2]~oring[,1],binomial)
`