Browsing resource, all submissions are temporary.

Stat 100 Survey Data

In this problem, you are going to look at Stat 100's survey data in Fall 2008. You will be asked to type many R commands you haven't learned yet. You will learn those commands later in the course. The goal here is to give you a preview of using R to explore and analyze data. As always, if you are curious about any particular command, use ? followed by the command to pull up a help page.

The survey data can be downloaded here. The file is in the format of comma-separated-value (csv). Many software (e.g. Excel) can be used to open this file. To load the data to R, you first need to copy the data file "Stat100_2008fall_survey02.csv" you downloaded to your R's working directory. You can find out your R's current working directory using the getwd() command. If you forget how to set your working directory, watch one of these videos again: for Windows, for Mac.

After copying/saving the data file to R's working directory, load the data to R by the command
survey <- read.csv("Stat100_2008fall_survey02.csv")

Looking at the data

Type class(survey) to confirm that survey is a data frame. You can take a look at the data using View(survey). The meaning of each column variable is explained here. There are a number of things you can do to explore the data:

Type dim(survey) to confirm that the numbers agree with the information given on this webpage.
Type sum(is.na(survey)) to confirm that there is no missing value in the data frame.
Type names(survey) to see the column names and confirm that they match the information given on this webpage.
In addition to View(), you can also use survey$ followed by the column names to see the values in the column. For example, type survey$gender[1:20] and you will see the first 20 values in the 'gender' column. Type class(survey$gender) to confirm that the 'gender' column is a factor variable.

Rounding Instruction: For all Lon Capa problems, when you are asked to round a number to e.g. 4 decimal places, it means you should round it to at least 4 decimal places. For example, if the correct answer is 31.4159265358979 and you are asked to round it to 4 decimal places. Numbers such as 31.4159, 31.41593, 31.415927, 31.4159265358979 will be marked correct. Even 31.41594 will be accepted since it's the same as 31.4159 when rounded to 4 decimal places. If you are asked to round the number to 4 significant figures, 31.42, 31.416, 31.4159265, ... etc will be accepted, but 31.4 will be marked wrong.

a. Type summary(survey$gender) to get a summary of the data in the 'gender' column. How many female students are there?

Tries 0/2

b. How many students said that they would vote for Barak Obama/Joe Biden? (Hint: apply the summary() function to the appropriate column)

Tries 0/2

c. Type class(survey$ACT) to confirm that the 'ACT' column is an integer vector. What is the median of ACT of the students taking the survey?

Median =

Tries 0/2

d. Type plot(GPA ~ party, data=survey) to make a plot of students' GPA versus average party hours/week. Try to guess the correlation coefficient between the party hours/week and GPA based on the plot. (No submission required)

e. Now use cor(survey$party,survey$GPA) to calculate the correlation. Give your answer to 3 decimal places.

Tries 0/2

Type plot(drink ~ party, data=survey) to plot the student's average number of drinks/week versus their average party hours/week. You can also create split plots of the data. Type the following commands:

library(lattice)
xyplot(drink ~ party | gender, data=survey)

You should see two plots of drink ~ party: one for male students and the other for female students. Now try

xyplot(drink ~ party | ethnicity, data=survey)

to see split plots with ethnicity. Then type

xyplot(drink ~ party | ethnicity*gender, data=survey)

for split plots with ethnicity and gender.

f. You can see from the plots that there are students who claimed to party 50 hours/week and/or have 50 drinks/week. What are the ethnicity and gender of these students?
Males and females in all ethnic groups
White males
Females in "Other" ethnicity.
Hispanic/Latino males and females
Black males and females
Asian males

Tries 0/2

g. Type histogram(~drink |ethnicity, breaks=-0.5:50.5, data=survey) to create split plots of histograms of the average number of drinks/week with ethnicity. (If you get the error message 'Error: could not find function "histogram"', type library(lattice) and then retype the histogram command.) From these histograms, which ethnic group has the highest percentage of students who don't drink at all?
Hispanic/Latino
White
Black/Afraican American
Other
Asian

Tries 0/2

h. Type fit <- lm(drink ~ party, data=survey) to fit a linear model predicting average drinks/week from average party hours/week. The intercept and slope (the number under 'party') are given in the last line of the output when you type fit. What are the intercept and slope? Give you answer to 3 decmal places.

Intercept =

Tries 0/2

Slope =

Tries 0/2

This means that for every increase in 1 party hour/week, the predicted increase in the number of drinks/week is

Tries 0/2

You can plot the regression line and data together using the commands

plot(drink ~ party, data=survey)
abline(fit)