Browsing resource, all submissions are temporary.
Confidence intervals in a simple linear model are constrcuted by assuming that y=β0 + β1 x + ε, with the residuals ε independent of x (homoscedasticity). What happens if the homoscedasticity condition is violated? In this problem, you are going to perform a simulation to explore this issue.
First we consider the case in which the homoscedasticity condition holds and verify that the confidence interval is accurate.
Copy and paste the R code and then execute it.
set.seed(61494383) x <- runif(100,-1,1) y <- 2 + 3*x + rnorm(100,sd=0.3)
By construction, x is a vector of length 100 and its values are assigned by random numbers uniformly distributed between -1 and 1. y is a vector of length 100, which is assigned according to 2+3*x plus a noise term following a normal distribution with zero mean and standard deviation = 0.3. This is a simulation of random sampling. If you fit a linear model predicting y from x from a very large sample size, you will get an intercept very close to the population intercept 3 and a slope very close to the population slope 2. Since the sample size is 100 in this case, the sample intercept and slope should deviate quite a bit from the population intercept and slope.
a. Fit a linear model predicting y from x, make a residual plot, and then construct an 80% confidence interval for the sample slope. Round your answer to 2 decimal places.
80% CI for the sample slope: from to
Is the population slope (= 3) inside this CI? yes no
If the residuals follow a normal distribution and the homoscedasticity assumption holds for the data, the probability that the population slope falls within the 80% CI of the slope estimated from a random sample is 0.8. To test this hypothesis, you will perform a simulation. Just like before, you randomly choose 100 points in y according to 3+2*x+rnorm(100,sd=0.3) and then construct a 80% CI and then check to see if slopepop=3 falls within the CI. You will do this 10,000 times and use the result to estimate the probability. The code for the simulation is as follows:
slope_pop <- 3 # population slope # Initialize slope_in_CI to a vector containing 10,000 NA's slope_in_CI <- rep(NA,10000) set.seed(61494383) x <- runif(100,0,1) for (i in 1:10000) { y <- 2 + 3*x + rnorm(100,sd=0.3) CI <- confint(lm(y~x), "x", level=0.8) # Check to see if slope_pop is inside CI s_in_CI <- (slope_pop > CI[1] & slope_pop < CI[2]) slope_in_CI[i] <- s_in_CI }
In the code above, CI[1] is the lower limit of the estimated 80% CI of slope and CI[2] is the upper limit of the 80% CI. If slopepop is inside the CI, we have slopepop > CI[1] and slopepop < CI[2]. As a result, the logical variable s_in_CI in the code above is TRUE if slopepop falls within the CI, and FALSE if slopepop is outside of the CI. The variable slope_in_CI is a logical vector of length 10000, storing the results of whether or not slopepop falls within the CI in each of the 10,000 sampling experiments.
CI[1]
CI[2]
s_in_CI
slope_in_CI
Copy and paste the code to your R's console to execute it. It may take several seconds to tens of seconds to run the code, depending on your computer's speed.
b. What is the percentage of times slopepop falls within the CI? [Sanity check: it should be a number with no more than 2 decimal places (why?)]
Let's now perform a significance test.
Null Hypothesis H0: The 80% CI's computed by the linear models from the sample data are accurate. That is to say that the probability that slopepop falls within a CI is 0.8.
Alternative Hypothesis HA: The probability of slopepop falls within the CI calculated by the linear model from a sample data ≠ 0.8.
c. Set up a box model assuming H0: consider a box with 100 tickets inside. 80 of them are written '1' and 20 of them are written '0'. A ticket with '1' represents slopepop falls within a CI (s_in_CI is TRUE) and that with '0' represents slopepop is outside the CI (s_in_CI is FALSE).
The result stored in the logical vector slope_in_CI is equivalent to drawing tickets with without replacement with '0' representing FALSE and '1' representing TRUE.
d. Assume H0 is true. The expected percentage of slopepop falling within the 80% CI's is %.
e. Calculate the standard error of the percentage, SE%. Give your answer to 2 decimal places.
(If you forget how to calculate SE%, review Chapter 14 and the shout-cut SD formula of Stat 100 notes.)
SE% = %
f. Calculate the Z score corresponding to the observed percentage in part (b). Give your answer to 2 decimal places.
Z score =
Adopt a null cutoff α = 5%. Reject the null? yes no
Now consider the case in which the homoscedasticity condition is violated. We want to know what happens if we still use the confidence interval based on the homoscedasticity assumption.
The following is a simulation similar to the previous one. The only difference is that the value of y is to set 3+2*x+rnorm(100,sd=0.3*abs(x)). The added noise is normally distributed with zero mean but the standard deviation changes from 0 at |x|=0 to 0.3 at |x|=1. The homoscedasticity condition is thus violated.
slope_pop <- 3 # population slope # Initialize slope_in_CI to a vector containing 10,000 NA's slope_in_CI <- rep(NA,10000) set.seed(61494383) x <- runif(100,-1,1) for (i in 1:10000) { y <- 2 + 3*x + rnorm(100,sd=0.3*abs(x)) CI <- confint(lm(y~x), "x", level=0.8) # Check to see if slope_pop is inside CI s_in_CI <- (slope_pop > CI[1] & slope_pop < CI[2]) slope_in_CI[i] <- s_in_CI }
Typical regression and residual plots look like the following. You can clearly see that the homoscedasticity condition is violated.
g. What is the percentage of times slopepop falls within the CI?
h. Repeat the hypothesis test above. Calculate the Z score corresponding to the observed percentage in part (g). Give your answer to 2 decimal places.
Adopt a null cutoff α = 0.1%. Reject the null? yes no