Browsing resource, all submissions are temporary.
In a multivariable regression, the F statistic and the associated p-value gives an information on whether at least one of the variables used in the regression is significant. However, the calculation of the p-value is based on certain assumptions which may or may not be true. The randomization test provides an independent way to estimate the p-value. In this exercise, you will use the randomization test described in Chapter 22 of the Stat 200 notes to estimate the p-value and compare it with the p-value estimated from the F statistic.
You will use the Stat 100 Survey 2 data from Fall 2011. The csv file can be downloaded here. You can load it to R using the command
survey <- read.csv("Stat100_2011fall_survey02x.csv")
The information of the column variables can be found on this webpage.
a. Fit a linear model predicting a student's GPA from the party hours/week (variable 'partyHr') and the number hours spend on facebook a day (variable 'fbHr'). What are the values of R2 and the p-value associated with the F statistic? Enter your answers to 4 significant figures.
R2 =
p-value associated with the F statistic =
This means that all slopes and the intercept are significant. none of the slopes nor the intercept is significant. none of the slopes is significant. at least one slope is significant. all slopes are significant.
Unlike the example given in Chpater 22 of the Stat 200 notes, the relatively large p-value calculated above means that you won't need a large number of randomization experiments to obtain an estimated p-value with decent accuracy. You will perform 5000 randomization experiments in which you scramble the students' GPA, fit a linear model and calculate R2 for each experiment. Then you count the fraction of experiments with R2 greater than the value computed in part (a). It is the estimated p-value.
We use the sample() function to scramble the GPA. Recall that sample(x,k) randomly chooses k items from vector x without replacement. When the second argument is omitted, sample(x) is the same as sample(x,length(x)), which is a random permutation of items in x. The following function takes three vectors y, x1 and x2 as inputs. It randomly scrambles the numbers in y, fits a linear model predicting the scrambled y from x1 and x2, then returns the value of R2 of the linear model. These can all be done in one line.
sample()
sample(x,k)
sample(x)
sample(x,length(x))
rand <- function(data_frame=survey) { summary(lm(sample(GPA) ~ partyHr+fbHr, data=data_frame))$r.squared }
A randomization experiment can be carried out by calling the rand() function: rand(survey) or simply rand() since the default parameter is set to survey. The function returns the R2 of one randomization experiment. To obtain a decent estimate of the p-value, we want to repeat the experiment many times. Consider the following code.
rand()
rand(survey)
survey
R2model <- summary(lm(GPA ~ partyHr+fbHr,data=survey))$r.squared RNGversion("3.5.0") set.seed(5363364) R2 <- MISSING CODE
The variable R2model stores the value of R2 of the original data, the one you computed in part (a). The vector R2 stores the values of R2 from the 5000 randomization experiments.
R2model
R2
b. What could be the missing code that replaces MISSING CODE above? (Select all that apply) rep(rand(), 5000) replicate(rand(), 5000) lapply(survey, rand, 1:5000) mapply(rand, survey, 1:5000) sapply(1:5000, rand) replicate(5000, rand()) rep(5000, rand())
The rest of the questions will show up after you complete part (b).