1. LON-CAPA Logo
  2. Help
  3. Log In
 

Browsing resource, all submissions are temporary.


Shoe Size and Number of Pairs of Shoes Owned

In this problem, you are going to analyze the data set shown on p.21 of your Stat 200 notes (Fall 2017 edition). Download the Stat 100 Spring 2013 Survey 1 data here (right click and save link as...) and save it to your R's working space. Type

survey <- read.csv("Stat100_2013spring_survey01.csv")

to load the data. The meaning of the column variables are given on this webpage. Explore the data set.

Type cor(survey$Shoe_number,survey$Shoe_size) and confirm that you get the same correlation (-0.3289) as stated in the notes. Type

plot(Shoe_number ~ Shoe_size, data=survey, xlim=c(0,16))

to make the plot shown on p.21 of your notes. Next, fit a regression line using the command

fit <- lm(Shoe_number ~ Shoe_size, data=survey)

Type abline(fit) to superpose the regression line on the plot1.

a. Type fit to get the intercept and slope (number below Shoe_size) of the regression line. Enter the coefficients (to 2 decimal places) below.

Intercept =             Slope =

 Tries [_1]

It is suggested in the notes that gender is a confounder for the apparent negative correlation between shoe numbers and shoe size. To test this idea, let's first make a split plot of shoe numbers ~ shoe size with gender. Type the following commands:2

library(lattice)
xyplot(Shoe_number ~ Shoe_size | Gender, data=survey, layout=c(1,2))

You should see there doesn't seem to have any correlation between shoe numbers and shoe size in the split plots. Let's now analyze the problem quantitatively by breaking the data set into male and female subgroups. This can be done using the subsetting technique you've just learned:

male <- survey[survey$Gender=="male", ]
female <- survey[survey$Gender=="female", ]

Here male and female are new data frames. Try

plot(Shoe_number ~ Shoe_size, data=male)

and plot(Shoe_number ~ Shoe_size, data=female) to confirm that you get the same plots for the two groups.

b. Compute the correlation between shoe numbers and shoe size for each subgroup. Give your answer to 4 decimal places.

Correlation of shoe numbers and shoe size for males =

 Tries [_1]

Correlation of shoe numbers and shoe size for females =

 Tries [_1]

So you see that gender is indeed the confounder.

c. Fit a linear model for both male and female subgroup using the commands

fit_male <- lm(Shoe_number ~ Shoe_size, data=male)
fit_female <- lm(Shoe_number ~ Shoe_size, data=female)

Type fit_male and fit_female to see the coefficients. Enter the answers to two decimal places.

Male:     intercept =         slope =

 Tries [_1]

Female:     intercept =         slope =

 Tries [_1]

You should see that the slope for male is close to 0, indicating that there is little correlation between shoe size and shoe number. The slope for female is larger, but it is not statistically significant. The information is encoded in summary(fit_male) and summary(fit_female), as you will learn later.

Finally, just for fun, here is a code that superposes the regression lines on top of the split plots for male and female:3

xyplot(Shoe_number ~ Shoe_size | Gender, data=survey, layout=c(1,2),
  panel = function(x, y, ...) {
       panel.xyplot(x, y, ...)
       panel.lmline(x, y, col = "red")
  })

Footnotes

1. You can also make a similar plot using the following ggplot2 commands.

library(ggplot2)
ggplot(survey, aes(Shoe_size,Shoe_number)) + geom_point() + geom_smooth(method="lm", se=FALSE)

2. You can also make a similar plot using the ggplot2 command

ggplot(survey, aes(Shoe_size, Shoe_number)) + geom_point() + facet_wrap(~Gender, nrow=2)

3. You can also make a similar plot using the ggplot2 command

ggplot(survey, aes(Shoe_size, Shoe_number)) + geom_point() + facet_wrap(~Gender, nrow=2) + geom_smooth(method="lm", se=FALSE)