LON-CAPA Confidence Interval for the Mean

Browsing resource, all submissions are temporary.

Confidence Interval for the Mean

Suppose x is a numeric vector of length n representing a random sample drawn from a population. We are interested in estimating the population mean μ from the sample x. You have learned in other statistics courses that an unbiased estimator for μ is the sample mean x, where

$\overline{x} = \frac{1}{n}\sum\limits_{i=1}^n x_i = \frac{1}{n}(x_1 + x_2 + \cdots + x_n)$

According to the central limit theorem, if n is large, the distribution of x approaches a normal distribution with mean μ and standard error SE = σ/√n, where σ² is the population variance. If we don't know the population variance, we can use the sample variance s² as a substitute, where

$s^2 = \frac{1}{n-1}\sum\limits_{i=1}^n (x_i-\overline{x})^2$

This allows us to construct a confidence interval (CI) for μ when n is large, as you have learned in Stat 100. In particular, if α is a number between 0 and 1, the 100(1-α)% confidence interval for μ is given by the formula

$\overline{x} \pm Z_{\alpha/2} \frac{s}{\sqrt{n}}$

where Z_α/2 is the Z-score such that P(Z > Z_α/2) = α/2 or equivalently P(Z < -Z_α/2) = α/2. This can be calculated using the qnorm() function: Z_α/2 = -qnorm(α/2).

For example, suppose we are interested in the 95% CI. We have α=0.05 and Z_α/2 = Z_0.025 = -qnorm(0.025) = 1.95996398454005. So

95% CI = x ± Z_0.025 s/√n.

In Stat 100, we usually round Z_0.025 to 2 and so the 95% CI is approximately the sample mean plus and minus 2 standard errors.

Write a function named CImean that calculates the 100*level% CI for the mean for a given numeric vector x, where level is a number between 0 and 1 (e.g. level=0.95 for 95% CI). Your function should have 2 arguments: the vector x and the CI parameter level. Set the default value of level to 0.95 (95% CI). The function should return a vector of length 2 corresponding to the lower and upper values of the CI. A prototype of the function is as follows.

CImean <- function(x, level=0.95) {
# fill in the calculation of the 100*level % CI

# output the CI as a vector of length 2
}

Here are some example output from this function:

set.seed(23452)
CImean(runif(100))

[1] 0.4575869 0.5698480

set.seed(123456)
CImean(rchisq(120,5), 0.8)

[1] 4.808639 5.506952

The function that you write should be able to match these outputs.

a. (5 point) Code validation. Ideally, you will submit your function and a TA will look at it and test it out to see if it's correct. To save time, however, you will do the following instead. After you complete your function CImean() and test it to your satisfaction, run the following code:

validate <- function() {
  x <- runif(sample(20:50,1))^(1/runif(1,0.1,2))
  level <- runif(1, 0.6,0.99)
  CImean(x,level)
}

RNGversion("3.5.0") 
set.seed(3575224)
y <- replicate(1000, validate())
u <- sum(8*y[1,] - 6*y[2,])

The code doesn't do any meaningful calculation. It is designed just for code validation. It calls your function CImean() 1000 times on randomly generated x and level parameters. The result is stored in the variable y, which is a 2×1000 matrix. The 2000 numbers are then combined to produce a single number u. Run the code with your function CImean() and look at the value of u. Your value will be compared with our value, which has been calculated using our CImean() function. If your function is correct, you should get the same value of u as ours. The set.seed() command is to make sure you are using the same set of randomly generated x's and level's in the calculation as ours. Enter the value of u to 3 decimal places.

Note: R changed its ranrom number generation for sample() starting in version 3.6. This question was created before that version. The command RNGversion("3.5.0") is to tell R to revert its random number generator to version 3.5.0 so that the result will match the computer's answer.

u =

Tries 0/3

A tumor is an abnormal growth of cells that serves no purpose. A tumor can be benign, which is harmless, or malignant, which is cancer. The following is a simulated data containing 1000 observations of renal cortical tumors, a type of kidney tumors. The data can be downloaded here and then loaded to R using the command

tumor <- read.csv("tumor264.csv")

The first column, named 'size', is the tumor size in cm. The second column, named 'y', is an integer indicating whether the tumor is benign (y=0) or malignant (y=1).

Enter all of your answers in the following questions to 3 decimal places.

b. Suppose that the data were gathered from a random sample of patients having renal cortical tumors. Estimate the proportion of tumors being malignant by calculating the mean of y.

Proportion of malignant tumors =

Tries 0/5

c. Calculate the 99% confidence interval for the average sizes of beneign and malignant tumors.

99% CI for the average size of beneign tumors = ( , ) cm

Tries 0/5

99% CI for the average size of malignant tumors = ( , ) cm

Tries 0/5