Browsing resource, all submissions are temporary.

Confidence Interval on Trial

Confidence intervals in a simple linear model are constrcuted by assuming that y=β₀ + β₁ x + ε, with the residuals ε independent of x (homoscedasticity). What happens if the homoscedasticity condition is violated? In this problem, you are going to perform a simulation to explore this issue.

First we consider the case in which the homoscedasticity condition holds and verify that the confidence interval is accurate.

Copy and paste the R code and then execute it.

set.seed(61494383)
x <- runif(100,-1,1)
y <- 2 + 3*x + rnorm(100,sd=0.3)

By construction, x is a vector of length 100 and its values are assigned by random numbers uniformly distributed between -1 and 1. y is a vector of length 100, which is assigned according to 2+3*x plus a noise term following a normal distribution with zero mean and standard deviation = 0.3. This is a simulation of random sampling. If you fit a linear model predicting y from x from a very large sample size, you will get an intercept very close to the population intercept 3 and a slope very close to the population slope 2. Since the sample size is 100 in this case, the sample intercept and slope should deviate quite a bit from the population intercept and slope.

a. Fit a linear model predicting y from x, make a residual plot, and then construct an 80% confidence interval for the sample slope. Round your answer to 2 decimal places.

80% CI for the sample slope: from to

Tries 0/3

Is the population slope (= 3) inside this CI?
yes
no

Tries 0/1

If the residuals follow a normal distribution and the homoscedasticity assumption holds for the data, the probability that the population slope falls within the 80% CI of the slope estimated from a random sample is 0.8. To test this hypothesis, you will perform a simulation. Just like before, you randomly choose 100 points in y according to 3+2*x+rnorm(100,sd=0.3) and then construct a 80% CI and then check to see if slope_pop=3 falls within the CI. You will do this 10,000 times and use the result to estimate the probability. The code for the simulation is as follows:

slope_pop <- 3 # population slope
# Initialize slope_in_CI to a vector containing 10,000 NA's
slope_in_CI <- rep(NA,10000) 
set.seed(61494383)
x <- runif(100,0,1)
for (i in 1:10000) {
   y <- 2 + 3*x + rnorm(100,sd=0.3)
   CI <- confint(lm(y~x), "x", level=0.8)
   # Check to see if slope_pop is inside CI
   s_in_CI <-  (slope_pop > CI[1] & slope_pop < CI[2])
   slope_in_CI[i] <- s_in_CI
}

In the code above, CI[1] is the lower limit of the estimated 80% CI of slope and CI[2] is the upper limit of the 80% CI. If slope_pop is inside the CI, we have slope_pop > CI[1] and slope_pop < CI[2]. As a result, the logical variable s_in_CI in the code above is TRUE if slope_pop falls within the CI, and FALSE if slope_pop is outside of the CI. The variable slope_in_CI is a logical vector of length 10000, storing the results of whether or not slope_pop falls within the CI in each of the 10,000 sampling experiments.

Copy and paste the code to your R's console to execute it. It may take several seconds to tens of seconds to run the code, depending on your computer's speed.

b. What is the percentage of times slope_pop falls within the CI? [Sanity check: it should be a number with no more than 2 decimal places (why?)]

%

Tries 0/3

Let's now perform a significance test.

Null Hypothesis H₀: The 80% CI's computed by the linear models from the sample data are accurate. That is to say that the probability that slope_pop falls within a CI is 0.8.

Alternative Hypothesis H_A: The probability of slope_pop falls within the CI calculated by the linear model from a sample data ≠ 0.8.

c. Set up a box model assuming H₀: consider a box with 100 tickets inside. 80 of them are written '1' and 20 of them are written '0'. A ticket with '1' represents slope_pop falls within a CI (s_in_CI is TRUE) and that with '0' represents slope_pop is outside the CI (s_in_CI is FALSE).

The result stored in the logical vector slope_in_CI is equivalent to drawing tickets replacement with '0' representing FALSE and '1' representing TRUE.

Tries 0/2

d. Assume H₀ is true. The expected percentage of slope_pop falling within the 80% CI's is %.

Tries 0/3

e. Calculate the standard error of the percentage, SE_%. Give your answer to 2 decimal places.

(If you forget how to calculate SE_%, review Chapter 14 and the shout-cut SD formula of Stat 100 notes.)

SE_% = %

Tries 0/5

f. Calculate the Z score corresponding to the observed percentage in part (b). Give your answer to 2 decimal places.

Z score =

Tries 0/3

Adopt a null cutoff α = 5%. Reject the null?
yes
no

Tries 0/1

Now consider the case in which the homoscedasticity condition is violated. We want to know what happens if we still use the confidence interval based on the homoscedasticity assumption.

The following is a simulation similar to the previous one. The only difference is that the value of y is to set 3+2*x+rnorm(100,sd=0.3*abs(x)). The added noise is normally distributed with zero mean but the standard deviation changes from 0 at |x|=0 to 0.3 at |x|=1. The homoscedasticity condition is thus violated.

slope_pop <- 3 # population slope
# Initialize slope_in_CI to a vector containing 10,000 NA's
slope_in_CI <- rep(NA,10000) 
set.seed(61494383)
x <- runif(100,-1,1)
for (i in 1:10000) {
   y <- 2 + 3*x + rnorm(100,sd=0.3*abs(x))
   CI <- confint(lm(y~x), "x", level=0.8)
   # Check to see if slope_pop is inside CI
   s_in_CI <-  (slope_pop > CI[1] & slope_pop < CI[2])
   slope_in_CI[i] <- s_in_CI
}

Copy and paste the code to your R's console to execute it. It may take several seconds to tens of seconds to run the code, depending on your computer's speed.

Typical regression and residual plots look like the following. You can clearly see that the homoscedasticity condition is violated.

regression plot residual plot

g. What is the percentage of times slope_pop falls within the CI?

%

Tries 0/3

h. Repeat the hypothesis test above. Calculate the Z score corresponding to the observed percentage in part (g). Give your answer to 2 decimal places.

Z score =

Tries 0/3

Adopt a null cutoff α = 0.1%. Reject the null?
yes
no

Tries 0/1