STAT J530 Student Questions with Answers

- The concept of fixed zero is not so clear to me, can you please expand on that, maybe few examples? For example, on the interval variable, we do not have a fixed zero, but there is a relationship exists that could be used to set a zero level (absolute zero in K is correlated to F and C).

Yes, you're right, absolute zero in K can be converted to F or C, but I guess it still remains the fact that the number 0 doesn't have any special meaning in F or C. It's arbitrary rather than inherently meaningful as it is for zero Kelvins, or zero inches in length, or something like that. It's hard to think of too many examples of interval measurements in the real world. Decibel levels may be one. You can say that a sound of 80 dB is 60 dB's louder than a sound of 20 dB, but it doesn't make sense to say that it is four times louder, because 0 dB is arbitrary (it doesn't have an inherent meaning). This isn't entirely accurate, because decibels are measured on a logarithmic scale. But it's fair to say that the position of "zero decibels" is arbitrary.

- In Ratio variable, lets say that we have a group of people that went to Vegas, we have break evens, winners (+) and losers (-), so there is a fixed zero, the break even, but how is this a ratio variable, please expand.

It makes sense to take ratios here, because if Jim made a profit of $8 and Sam made a profit of $2, it is OK to say that Jim's profit is four times as large as Sam's. Zero has an inherent meaning here.

- While I clearly understand the missing data slide very well, I have a question on multiple data points. What if we have the same response measured under multiple factor combinations, or what if we have the same factor combination and multiple responses? Does that mean we are not looking at Multivariate data set? Or is that a sign of uncertainty?

I think this wouldn't be a multivariate data set in the traditional sense. While you could possibly define variables to create a multivariate data matrix, if you analyzed such data using multivariate analysis methods, you would probably lose the information in what the factor combinations mean. You wouldn't be able to examine interactions between factors, for example.

- I like R, but one challenge with it is, what if I have my data in a excel sheet, can you demonstrate how we form the data structure from an external file(such as excel), I tried from some tutorials and have not been very successful.

I don't do this myself -- I usually copy the data from Excel into a text editor and go from there. But I've heard that a good function for reading Excel data into R is the read.xls function in the gdata package. Type:
install.packages("gdata")
library(gdata)
help(read.xls)

for more details.

- When we used the correlation matrix, it basically told us the relationships between each factor, for example we found out that, age of the husband and age of the wife are (highly +) correlated. This tells us the generic trends in the data, why do we need the euclidean distances? Euclidean distances in this case is a 10 x 10 matrix, which is symmetrical (as expected), what does it really tell us? Is there any relationship between the correlation data we have and the distances? Can you please expand on this and give us an idea on what is the importance of this matrix?


The Euclidean distances (in the distance matrix) tell us how different the pairs of individuals are from each other. The correlations tell us how different or similar the variables are from each other. Sometimes we speak of distances between variables (which are then usually some function of the correlation values), but this is rare. Typically we deal with distances between individuals. So to summarize, the distance matrix is n-by-n and shows pairwise dissimilarities between individuals, while the correlation matrix is q-by-q and shows pairwise similarities between variables.

The kde2d function in the MASS package provides the same functionality as Everitt's home-brewed function for calculating bivariate density. I like to use functions from standard packages whenever possible and I will assume that it is OK with you that I do this.

In general, I don't mind that as long as you are familiar with the other function and know that it's accomplishing (essentially) the same thing. For people who are new to R, I would recommend sticking with the functions that we have talked about in class. But for experienced R users, feel free to look around R for other functions that do similar/identical tasks.

Can I open chap2airpoll.dat directly in R or do I have to copy and paste the text?

You can copy and paste it into R as we have done in some of the examples in class (like the husband-wife data set), or you can use the read.table function to read it into R. For example, something like:
my.data <- read.table("http://www.stat.sc.edu/~hitchcock/usair.txt", header=T)
should work since the first line is simply a header.

I am sorry to bother you again, but I still can't get R to read a local file. I tried to modify your example:
>bumpbird <- read.table("Z:/stat_530/bumpusbird.txt", header=T)
to:
> chap2airpol <- read.table("E:\Multivariate_Stats_0908\Data\chap2airpol.txt", header=T)
I get:
Error: '\M' is an unrecognized escape in character string starting "E:\M"

I can attach the file via the web, but this doesn't help me much if I want to attach a local file.


One error I see is that R wants the slashes in the directory specification to go the opposite way from which you have them, i.e.:
Use / rather than \ in the file address.

Another easy way to get around this is to go to File -> Change dir... in the R pull-down menu. Navigate to the directory where your file is located and click OK. Then you can specify just the name of the file, like:
read.table("chap2airpol.txt", header=T)

I'm a little confused about problem 2.6. Is it asking us to construct chiplots for each pair of variables, or am I just supposed to analyze how effective it is in showing independence?

You can construct chiplots for each pair of variables. You don't need to print all of them out for your homework, though -- there's a lot of them. Just summarize in a paragraph what your conclusions are from all the chiplots, and you could include a few (not all!) of the chiplots as support for your conclusions.

I am a little bit unsure of myself in reading contour plots as in 2.2 even after revisiting your lecture demonstration. What sort of information can be gathered about the relationships between pairs in contour plots?

In general, a contour plot tells you about the shape of a bivariate density. Is it symmetric? Is it skewed? Is there an association between the two variables? Is there more than one peak?

Is there a way to keep all images in R open when there are multiple images?

Yes, type and enter

windows()

in between the graphics commands. It will open a new graphics window and keep the old one open.

I sometimes go back to certain parts of the lecture so as to see how certain operations are performed. I've had the option of fast-forwarding in some instances, but in others I can't. Do you have any idea of how I could control this option so I could fast-forward?

I talked to the director about this and she said that usually dragging the little arrow (that indicates the elapsed time of the recording) across from left to right allows you to fast-forward. BUT sometimes you have to click the "PAUSE" icon before dragging the arrow.

Questions for 2.1:
Should be the variables scaled or standardized? I think the answer is yes, since the variables seems unrelated, not like the state.x77 data which are all related to population.


By default, the "stars" function will automatically scale the variables so that all values fall within [0, 1]. The "faces" function also scales the variables by default. So you don't need to do separate scaling.

[Related to R:] Is there a good source for some basic code syntax or online tutorial that I can take advantage of? Thanks for any guidance you can give!

I understand that R can have a steep learning curve at first. If it's any consolation, I think the best way to learn a programming language is to spend some time struggling with it. That's part of the reason I assigned some "R programming" type problems on the first HW, as a learning tool.

As for online tutorials on R, one good source is a handbook that Dr. Don Edwards wrote that we use in the STAT 517 class here. Go to:
http://www.stat.sc.edu/~hitchcock/DE_Basics_R.doc
to find this handbook (as a Word document).

Also, you can go to:
http://www.stat.sc.edu/~hitchcock/stat517fall2008.html
Scroll down to:
Example R Code from Class
and click on some of the examples, copy them into R, and just see what happens to get a feel for the way the language works.

I was trying to work through the example on pg 60 of the text and was having trouble finding the lqs library. I think I saw something on the web about lqs being in MASS so I did the following, but the matrices have differing numbers than those found in the text. Do you have any idea why?

I checked and it looks like the cov.mve function is in the MASS package now. So with

library(MASS)
you are able to use cov.mve as you did below.

As for why the numbers are different, notice on page 60 (middle) it says, "Different estimates will result each time this code is used."
If you type
help(cov.mve)
you will see that the method involves random sampling of the data, so the exact numbers are not repeatable.

When I tried the code on P52, textbook, I got different results from Table 3.2. The loadings I got for component 1 and component 4 has the same absolute value with the table, but different symbols (I mean, the comp 1 in Table 3.2 are 0.33, 0.612, 0.578, 0.354 ………………, but my results are -0.33, -0.612, -0.578,- 0.354…………). All the other part are the same.My understanding is that the symbol will not effect the results, but they may render the interpretation. And I can not figure out the exact mathematical explanation. Could you please help me out of this problem?

Yes, these are essentially mathematically equivalent solutions. The reason is that we are looking for linear combinations of x_1,...,x_q that have the maximum possible variance. Note that when calculating the variance of a linear combination, the coefficients get squared in the calculation. So whether the coefficients are positive or negative, the variance of the linear combination will be the same.

[A source of] confusion seems to be the result of reading several different books/papers on pca. Some of these references refer to the eigenvectors as the principal components, which is the terminology I think you are using. Others (notably Everitt in "A Handbook Statistical Analyses Using R") refer to the rotated observations as the principal components (I think we have been calling these the scores). Is there a problem in pca with standardization of terminology or is the problem with me?

I've been following the terminology of our textbook, which is: The principal components are the linear combinations of X_1, ..., X_q, the coefficient of which are taken from the eigenvectors. But when I say to write the "principal components", I mean to leave the linear combinations written in terms of X_1, ..., X_q.

The principal component scores for an observation are what we get when we plug that observation's x_1, ..., x_q values into each principal component, i.e., into each linear combination. Certainly in the HW problem you can't get the PC scores [when you are not given the raw data]. But you can write the PCs leaving them in terms of the X's.

I suppose you could think of it from a statistical standpoint as: The PC's are essentially random variables (since they are functions of X_1, ..., X_q), while the PC scores are the realized values of those random variables.

What methods do you expect us to consider using on the mid-term? For example, do you expect us to consider using MDS?

The intention is that the midterm cover the topics in Chapters 1-4. However, you could use other methods if you feel they are suitable. Mainly focus on the topics from the first four chapters of the textbook, though.

On chapter 4 we talked about the ML estimate of the factors (MLFA) as well as Principal Factor Analysis (PFA), do we prefer one to the other?

It's mostly a matter of personal preference. I like the ML approach since it has a natural hypothesis test associated with it that helps determine the correct number of factors. On the other hand, ML is technically more restrictive in that it assumes the data follow a multivariate normal distribution, while the PF approach does not make that assumption.

When we plotted both the MLFA and PFA for the model fits (life data or WAIS data), they both seemed OK, why would you choose one over other?

I would say either one would be fine in that case.

How does the rotation of every other factor except one effect the interpretation of the results?

I would say it improves interpretation, in that the single unrotated factors can be interpreted correctly (usually as some general, all-purpose factor) and then the other factors can be rotated to get them closer to "simple" structure, when aids their interpretability.

Are there any better 3D plots then "cloud" with better visibility in R?

Some other options are the plot3d function in the rgl package and the scatterplot3d function in the scatterplot3d package. (Both of those packages need to be installed from the Internet.) You could try those and let me know.

During Factor Analysis we looked mostly at p-value and not at all on the cumulative variation explained? Is that not significant? I would want to model so that I can explain as much as possible in the data, would not I?

That's important, but not as important as in PCA. Note that when we increase the number of factors to k+1 in FA, the first k factors change (unlike in PCA). So those first k factors might explain a different proportion of the variation than the k factors in a k-factor solution. Interpretation of factors is more crucial than the cumulative variation explained in FA.

How would you interpret [a factor] if there were equal numbers of [loadings] with opposite signs?

With such factors that have differing signs (bipolar factors), interpretation can be trickier. Often factor rotation can alleviate the problem. Anyway, if say, individuals with high scores on x1, x4, x8 tend to get high scores on this factor, and those with low scores on x3, x6, x7 tend to get high scores on this factor, maybe the factor is measuring the discrepancy between x1,x4,x8 and x3,x6,x7, whatever those variables are. [Like if "arm strength" has a positive loading and "leg strength" has a negative loading, the factor would be measuring the difference between arm strength and leg strength.]

We went over this snippet in class:

girls <- matrix(c(
21,21,14,13,8,
8,9,6,8,2,
2,3,4,10,10),
byrow=T, ncol=5, nrow=3,
dimnames = list(c('nbf','bfns','bfs'),c('AG1', 'AG2', 'AG3', 'AG4',
'AG5')))


temp <- corresp(girls,nf=2) # The two-dimensional solution.
biplot(temp)
abline(h=0)
abline(v=0)


I can't figure out where the coordinates for the points on the biplot are coming from. I would have assumed that the coordinates are the scores, which are shown below. However, they don't match up. For example, bfns is not at (-0.5116659, -1.717649325). So, where are the coordinates coming from (i.e., what are they)?

> temp
First canonical correlation(s): 0.37596222 0.08297209


Row scores:
[,1] [,2]
nbf -0.5142218 0.735371157
bfns -0.5116659 -1.717649325
bfs 1.9475881 0.002029264


Column scores:
[,1] [,2]
AG1 -0.9435410 0.663129426
AG2 -0.7706196 -0.003638259
AG3 -0.2747116 -0.001296972
AG4 0.7462718 -1.617752324
AG5 1.9069436 1.487224984


The points that are plotted are scaled versions of those coordinates. They are scaled by the "first canonical correlation" values that are given. (We will talk about canonical correlation in Chapter 8)

So bfns is at
(-0.5116659*0.37596222, -1.717649325*0.08297209) = (-.19, -.14).

This is true for the row categories, although the column categories seem to be further scaled, so their positions don't exactly follow this pattern. Maybe this is an artifact of having to put the row and column categories on the same plot in a meaningful way?

The help file on corresp explains this a little, but not completely clearly. The book by Venables and Ripley has more details.

It is not obvious to me whether or not we should standardize the data in each variable (divide each value of the variable by its range for example) before we use the model based cluster analysis (Mclust). My first inclination is not to standardize because the analysis routine is trying to model the variance in the data. Any advice?

I would agree with your assessment. The model-based clustering accounts for the differences in variances across variables in a way that the other methods do not. So I don't think it's necessary to standardize when using model-based clustering. It is sometimes recommended to transform some variables if the original data are not close to multivariate normal, but this is different than standardization.

In class you mentioned something to the effect that k-means cluster analysis was like Mclust with a constant variance (and zero covariance) across all clusters. With two variables this would make circles of constant size for all clusters (please confirm). However, it seems to me that with k-means the variance within a cluster will be constant but the variance from cluster to cluster can change. This will result in circles of varying sizes. Can you add anything to this without giving the end of the movie away?

K-means has structural similarities to model-based clustering with the equal volume, equal shape, spherical model (this was first pointed out in 1992 by Celeux and Govaert). As you point out, "With two variables this would make circles of constant size for all clusters." This doesn't mean that all the clusters that k-means produces will be exactly the same volume, but that is sort of the tendency. The difference is that this is sort of a by-product of k-means, whereas it is the motivating force behind the model-based clustering that assumes this covariance structure.

I know that when I want to exclude a column from an operation I put a negative sign in front of the column number, like table [,-1]. How do I exclude 2 columns?

If you want to remove, say, the first and the third columns from a matrix or data frame called my.table, just put
my.table[, -c(1,3)]
You can remove any number of columns this way by putting the column numbers, separated by commas, inside the c( ).

I used the command
fact(r=pain.corr,method="norm",rotation="none",maxfactors=4)
for example and I did not get a p-value or a test statistic so I can choose the number of factors. What is wrong/missing from this command? Thanks.


The "fact" function will not give you the test statistic or P-value, but the built-in "factanal" function will do this, as long as you specify the number of observations (using the n.obs argument). See the first few examples in the Chapter 4 R code for examples.

For HW problem 6.3, my scatterplot Matrix is only 5x5, but 9 variables are recorded in the pottery data set. Does R default to using only the first few columns on the scatterplot matrix when a certain number of variables is reached, or have I made a mistake?

That appears to be a weird default in the plot function for the Mclust function. If you want the full scatterplot matrix, you could always use the pairs function as we did in the non-model-based examples -- just change the name of the "clustering vector" to the clustering vector that you got from Mclust (i.e., the something$classification object).

Question 7.3(iii) resembles a "Grant's Tomb" question. My knee-jerk answer is to specify the prior probabilities in the "prior" argument of the "lda" routine. Is this what you were looking for?

Yes, you should definitely specify the prior probabilities in that "prior" argument. But additionally, you should give a brief rationale for the prior probabilities that you choose for this problem, for the SIDS group and for the healthy group.