Browsing resource, all submissions are temporary.
In this problem, you are going to look at Stat 100's survey data in Fall 2008. You will be asked to type many R commands you haven't learned yet. You will learn those commands later in the course. The goal here is to give you a preview of using R to explore and analyze data. As always, if you are curious about any particular command, use ? followed by the command to pull up a help page.
The survey data can be downloaded here. The file is in the format of comma-separated-value (csv). Many software (e.g. Excel) can be used to open this file. To load the data to R, you first need to copy the data file "Stat100_2008fall_survey02.csv" you downloaded to your R's working directory. You can find out your R's current working directory using the getwd() command. If you forget how to set your working directory, watch one of these videos again: for Windows, for Mac.
getwd()
After copying/saving the data file to R's working directory, load the data to R by the command survey <- read.csv("Stat100_2008fall_survey02.csv")
survey <- read.csv("Stat100_2008fall_survey02.csv")
Type class(survey) to confirm that survey is a data frame. You can take a look at the data using View(survey). The meaning of each column variable is explained here. There are a number of things you can do to explore the data:
class(survey)
survey
View(survey)
dim(survey)
sum(is.na(survey))
names(survey)
View()
survey$
survey$gender[1:20]
class(survey$gender)
Rounding Instruction: For all Lon Capa problems, when you are asked to round a number to e.g. 4 decimal places, it means you should round it to at least 4 decimal places. For example, if the correct answer is 31.4159265358979 and you are asked to round it to 4 decimal places. Numbers such as 31.4159, 31.41593, 31.415927, 31.4159265358979 will be marked correct. Even 31.41594 will be accepted since it's the same as 31.4159 when rounded to 4 decimal places. If you are asked to round the number to 4 significant figures, 31.42, 31.416, 31.4159265, ... etc will be accepted, but 31.4 will be marked wrong.
a. Type summary(survey$gender) to get a summary of the data in the 'gender' column. How many female students are there?
summary(survey$gender)
b. How many students said that they would vote for Barak Obama/Joe Biden? (Hint: apply the summary() function to the appropriate column)
summary()
c. Type class(survey$ACT) to confirm that the 'ACT' column is an integer vector. What is the median of ACT of the students taking the survey?
class(survey$ACT)
Median =
d. Type plot(GPA ~ party, data=survey) to make a plot of students' GPA versus average party hours/week. Try to guess the correlation coefficient between the party hours/week and GPA based on the plot. (No submission required)
plot(GPA ~ party, data=survey)
e. Now use cor(survey$party,survey$GPA) to calculate the correlation. Give your answer to 3 decimal places.
cor(survey$party,survey$GPA)
Type plot(drink ~ party, data=survey) to plot the student's average number of drinks/week versus their average party hours/week. You can also create split plots of the data. Type the following commands:
plot(drink ~ party, data=survey)
library(lattice) xyplot(drink ~ party | gender, data=survey)
You should see two plots of drink ~ party: one for male students and the other for female students. Now try
xyplot(drink ~ party | ethnicity, data=survey)
to see split plots with ethnicity. Then type
xyplot(drink ~ party | ethnicity*gender, data=survey)
for split plots with ethnicity and gender.
f. You can see from the plots that there are students who claimed to party 50 hours/week and/or have 50 drinks/week. What are the ethnicity and gender of these students? Males and females in all ethnic groups White males Females in "Other" ethnicity. Hispanic/Latino males and females Black males and females Asian males
g. Type histogram(~drink |ethnicity, breaks=-0.5:50.5, data=survey) to create split plots of histograms of the average number of drinks/week with ethnicity. (If you get the error message 'Error: could not find function "histogram"', type library(lattice) and then retype the histogram command.) From these histograms, which ethnic group has the highest percentage of students who don't drink at all? Hispanic/Latino White Black/Afraican American Other Asian
histogram(~drink |ethnicity, breaks=-0.5:50.5, data=survey)
library(lattice)
h. Type fit <- lm(drink ~ party, data=survey) to fit a linear model predicting average drinks/week from average party hours/week. The intercept and slope (the number under 'party') are given in the last line of the output when you type fit. What are the intercept and slope? Give you answer to 3 decmal places.
fit <- lm(drink ~ party, data=survey)
fit
Intercept =
Slope =
This means that for every increase in 1 party hour/week, the predicted increase in the number of drinks/week is
You can plot the regression line and data together using the commands
plot(drink ~ party, data=survey) abline(fit)