Browsing resource, all submissions are temporary.
The csv file of the final scores and exam averages in a Stat 100 class many years ago can be downloaded here. The data here are not the same as that on p.29 of the Stat 200 note. After downloading the csv file to your work space, you can use the following command to load the data to R.
score <- read.csv("Stat100-Final_ExamAve_cleaned.csv")
The file has only two columns. The first column is the final score and the second column is the exam average. It has the scores for all of the Stat 100 students who took the final exam.
Suppose someone wants to find out the correlation between the exam average and final score for this particular Stat 100 class. She conducts a survey choosing 100 students randomly and asks them their exam averages and final scores, and use the survey result to estimate the correlation coefficient for the whole population.
We can simulate this survey using R's built-in function sample.int(): sample.int(N,k) generates k integers between 1 and N randomly without replacement. Like other random number generators, the same set of random numbers are generated when you specify a seed.
sample.int()
sample.int(N,k)
To perform the simulation, we use sample.int() to generate 100 random numbers between 1 and N, where N=1061 is the total number of students in this Stat 100 class. We then use these 100 numbers to subset the data frame score. Here is the code:
score
N <- nrow(score) # number of students RNGversion("3.5.0") set.seed(3485595) subset <- sample.int(N,100) score_subset <- score[subset, ]
Copy and paste the code above to your R console to run it. Please use the seed number given above; otherwise, your answers to the questions below won't match the computer's answers.
Note: R changed its ranrom number generation for sample() starting in version 3.6. This question was created before that version. The command RNGversion("3.5.0") is to tell R to revert its random number generator to version 3.5.0 so that the result will match the computer's answers.
a. The final scores and exam averages in the data frame score_subset simulates the result of a survey. What is the correlation between final score and exam average in score_subset? Enter your answer to 4 decimal places
score_subset
r =
b. Use the result in part (a) to compute the slope for predicting final score from the exam average from the subset data. Give your answer to 4 decimal places.
Slope =
c. While the person who conducts the survey doesn't know the population r and slope, you know since you have loaded the data for the whole population to the data frame score. Use it to compute the population r and the slope. Give your answer to 4 decimal places.
rpop =
Slopepop =
The sample slope follows approximately a normal distribution if the sample size n is large enough. We can see the distribution of slope by doing a simulation using R. It is also interesting to look at the distribution of sample r, which is not following a normal curve as you will see.
Imagine that there are 10,000 researchers conducting the survey. Each of them randomly chooses 100 students from the Stat 100 class and ask them their final scores and exam averages. Each researcher will get a slightly different sample correlation r and the corresponding sample slope. If we collect all values of sample r and slope, we can plot histograms for r and slope and see how closely they resemble normal curves. We can simulate these hypothetical 10,000 survey results by the following code:
Nsim <- 1e4 # repeat the experiment 10,000 times r <- rep(NA, Nsim) # initialize the r vector to 10,000 NAs slope <- rep(NA, Nsim) # initialize the slope vector to 10,000 NAs RNGversion("3.5.0") set.seed(3485595) for (i in 1:Nsim) { subset <- sample.int(N,100) score_subset <- score[subset,] r[i] <- cor(score_subset$Final, score_subset$ExamAve) slope[i] <- MISSING CODE }
The first two lines r <- rep(NA,Nsim) and slope <- rep(NA,Nsim) create two vectors of length 10,000. The line r[i] <- cor(score_subset$Final, score_subset$ExamAve) and slope[i] <- MISSING CODE replace the ith element of the vectors r and slope with the correlation and slope of the subset data. After the code is finished, r and slope store the values of sample r and sample slope from 10,000 surveys.
r <- rep(NA,Nsim)
slope <- rep(NA,Nsim)
r[i] <- cor(score_subset$Final, score_subset$ExamAve)
slope[i] <- MISSING CODE
d. What should MISSING CODE be? r*sd(score_subset$Final)/sd(score_subset$ExamAve) r[i]*sd(score_subset$Final)/sd(score_subset$ExamAve) r*sd(score_subset$ExamAve[i])/sd(score_subset$Final[i]) r[i]*sd(score_subset$Final[i])/sd(score_subset$ExamAve[i]) r[i]*sd(score_subset$ExamAve)/sd(score_subset$Final) None of the above
Copy and paste the code above to your R console to execute it. It may take fractions of a second to several seconds for R to execute the code, depending on your computer speed. Make sure to check that the first element of r, r[1], is exactly the same value as that obtained in part (a), and slope[1] is the same value as in part (b).
e. What are the minimum, maximum and mean of the sample slopes in the vector slope? Give your answers to 3 decimal places.
slope
Minimum slope = maximum slope = mean slope =
The point of doing this simulation is not just to get the summary statistics, but to look at the distributions. Plot a histogram of 'slope' and superpose a normal curve with the same mean and sd as follows.
hist(slope, breaks=100, freq=FALSE) abline(v=mean(slope), col="blue", lwd=2) curve(dnorm(x,mean(slope),sd(slope)), col="red", add=TRUE)
The command abline(v=mean(r), col="blue", lwd=2) draws a thick (lwd=2), blue vertical line at the mean of 'slope'.
abline(v=mean(r), col="blue", lwd=2)
You should see that the histogram follows the normal curve approximately.
f. For a large enough sample size, the standard error of the sample slope is given by . The actual standard error can be estimated from the simulation result using sd(slope). What is the estimated SEslope? Give your answer to 4 decimal places.
sd(slope)
SEslope ≈ sd(slope) =
The distribution of the sample r, on the other hand, is not normal. We expect the values of the sample r cluster around rpop, which is about 0.8. But we also know that r must be between -1 and 1. Thus, there is not much rooms to the right of rpop, but plenty of rooms to the left of rpop. Hence, we expect that the distribution to be left-skewed, meaning that there is a longer tail on the left of the distribution.
g. What are the minimum, maximum and mean of the sample correlations in the vector r? Give your answers to 4 decimal places.
r
Minimum r = maximum r = mean r =
h. Plot a histogram of r and superpose a normal curve with the same mean and sd on top of it. Which of the following plots is closest to the distribution of your sample r?
i. As you see that the sample correlations vary between min(r) and max(r). Estimate the probability that the sample correlation r obtained from a survey has r>0.85. (Hint: use the 10,000 values of correlations stored in the vector r and recall the definition of probability.) Sanity check: your answer should be a number between 0 and 1 and has no more than 4 decimal points (why?).
P(r > 0.85) ≈