Browsing resource, all submissions are temporary.
There is no built-in function in R to perform a one-sample Z test, but the calculation is straightforward. In this problem, you will perform a one-sample Z test similar to the Example 1 of these Stat 100 notes.
A large class has 1000 students and many discussion sections. The scores of the midterm exam are stored in this csv file. After downloading it to your R's work space, you can load it to R using the following command.
exam <- read.csv("exam0629.csv")
The file has two columns. The first column is the student's midterm exam score. The second column is the discussion section the student is in.
a. What is the average and the population standard deviation of the exam scores? Give your answers to 3 decimal places.
Mean =
Population standard deviation =
b. Plot a histogram of the exam scores. Which of the following best describes the distribution of the scores? The distribution is right-skewed. The distribution is left-skewed. The distribution is approximately symmetrical about the mean and resembles a normal curve. The distribution is bimodal.
c. How many discussion sections are there in this class?
d. It is suggested that the early morning section, section AD1, has more serious students and would therefore have higher scores. How many students are there in section AD1? What are the mean and population standard deviation of the exam scores for these students? Give your answers to 3 decimal places.
Number of students in section AD1 =
Average exam score in section AD1 =
Population sd of the exam scores in section AD1 =
Consider the following null and alternative hypothesis:
H0: The average exam score for students in section AD1 is not significantly different from the overall mean. The observed difference between the two averages is only due to chance variation.
HA: The average exam score for the AD1 students is greater than the overall mean. The observed difference is too large to be explained by chance.
Note that this is a one-sided test.
e. Compute the Z statistic according to the formula
Z = (sAD1 - s)/SEdiff ,
where sAD1 is the average score for the AD1 students computed in part (d), s is the overall mean computed in part (a). The standard error of the difference is given by
SEdiff = SD/√nAD1 ,
where SD is the population standard deviation of the exam score computed in part (a) and nAD1 is the number of students in section AD1. Give your answer to 2 decimal places.
Z =
Reject H0 at significance level α = 5%? yes no
The Z test uses the normal approximation. A method of doing the hypothesis test without using the normal approximation is to do a simulation. We can randomly choose nAD1 samples from the 1000 scores, calculate the sample mean and see how often the sample mean is greater than or equal to the average score calculated in (d). This is an alternative method of estimating the p-value.
First, we write a function that draws nAD1 samples from the 1000 scores without replacement and then calculate the sample mean. Here is a one-line function:
sample_mean <- function(scores, m) { mean(scores[sample.int(1000,m)]) }
This function draws m samples from the vector scores without replacement and then returns the sample mean. The vector scores is assumed to have a length of 1000. Copy and paste the function and then run the following code that performs a simulation that draws nAD1 samples 100,000 times.
scores
RNGversion("3.5.0") set.seed(8787549) sampleMeans <- replicate(1e5, sample_mean(exam$score, n_AD1))
Here the variable n_AD1 is nAD1, the number of students in the AD1 section. After running the code, the variable sampleMeans is a numeric vector of length 100,000 storing the sample means of the 100,000 sampling experiments.
n_AD1
sampleMeans
f. According to the central limit theorem, the distribution of the sample mean follows approximately a normal curve with mean = s and sd = SD/√nAD1. Plot a density curve of the sample means in sampleMeans as a black line and then superpose the normal curve with the mean and sd mentioned as a red line. Which one of the following plots is closest to your plot?
g. Count the fraction of times the sample means in sampleMeans are greater than or equal to the mean score calculated in (d). This is the p-value estimated from the simulation.
Sanity check: your answer should be a postive number between 0 and 1 and has no more than 5 decimal places.