Browsing resource, all submissions are temporary.
In this problem you will analyze the Stat 100 survey 3 data in Fall 2016. We are interested to see what variables are good predictors of students who supported Donald Trump. The data can be downloaded here and then loaded to R using the following command.
survey <- read.csv("stat100_2016fall_survey03.csv")
A description of the column variables can be found on this website.
The column named 'candidate' records students' answer to the question "If you had to vote (because there was a penalty if you didn't) which of the 2016 Presidential candidates would you vote for (assume you're eligible to vote even if you're not)".
Create a new column named 'y'. Set y to 1 for Donald Trump supporters and 0 for students who didn't support Trump.
Columns 21–31 correspond to students' responses on the importance of various political issues. These could be good predictors of the 'y' variable. After setting the 'y' variable, you can fit a logistic regression predicting the probability that a student is a Trump supporter P(y=1) from their ratings on these issues. You can use the command
fit1 <- glm(y ~ environment + terrorism + economy + racism + policeBrutality + lawAndOrder + genderEquality + borderSecurity + familyValues + bigMoney + gunRights + devil, data=survey, family=binomial)
to fit a logistic regression. In R, regression can also be carried out by putting the predictors in a matrix. For example, the following command does the same logistic regression as the command above.
fit1 <- glm(survey$y ~ as.matrix(survey[,c(21:31,3)]), family=binomial)
Note that we put 'devil' (column 3) in the model above because we find that Trump supporters tend to believe in devil.
a. Type summary(fit1) to see the coefficients. They are the coefficients of the ln(odds). Fill in the missing items in the following table. Enter your answers to 3 decimal places.
summary(fit1)
b. We see from the p-values that all slopes, except 'policeBrutality', are significant. The reason why 'policeBrutality' is not significant is that it is highly correlated with the variable 'racism'. Calculate the correlation between 'racism' and 'policeBrutality'. Enter your answer to 3 decimal places.
Correlation =
c. Knowing that 'policeBrutality' is highly correlated with 'racism', we can remove 'racism' from the model and see if 'policeBrutality' becomes significant. Type the following command:
fit2 <- glm(survey$y ~ as.matrix(survey[,c(21:23,25:31,3)]), family=binomial)
Type summary(fit2) and look at the p-value associated with 'policeBrutality'. Is the slope for 'policeBrutality' significant now? no yes
summary(fit2)
Go back to the model in (a). It has 12 variables and we know we can get rid of at least one variable without affecting the accuracy significantly. The question is if we can do better. Can we combined these variables and construct a simpler model?
We see that some slopes are negaive and some are positive. A negative slope means that the probability P(y=1) decreases when the value of the variable increases, keeping other variables fixed; and a positive slope means that P(y=1) increases when the value of the variable increases, keeping other variables fixed. All the variables represent ratings ranging from 0 to 10. We can reconstruct a new set of variables (x1, x2,..., x12) as follows.
x1 <- 10 - survey$environment x2 <- survey$terrorism x3 <- 10 - survey$economy x4 <- 10 - survey$racism x5 <- 10 - survey$policeBrutality x6 <- survey$lawAndOrder x7 <- 10 - survey$genderEquality x8 <- survey$borderSecurity x9 <- survey$familyValues x10 <- 10 - survey$bigMoney x11 <- survey$gunRights x12 <- survey$devil
Copy and paste the code to your R console. These new variables are constructed to reverse the direction of the ratings for the negative slope variables. For example, x1 is set to 10 - environment. The rating x1=0 means Environmental Protection is extremely important to the student and x1=10 means Environmental Protection is not at all important to the student. If we now fit a logistic regression predicting P(y=1) from x1, x2, ..., x12, all the slopes become positive. This means that the larger the value of any one of the new variables, the more likely the student is a Trump supporter. However, the new model is still the same as the old model. Now we construct a new variable x which is the average of the x1, x2,..., and x12. We create a new column in the data frame using the following command:
x1
x
survey$x <- (x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12)/12
This x variable can range from 0 to 10, and it combines all the ratings relevant to whether or not a student is a Trump supporter.
d. Fit a logistic regression predicting y from x using the command fit3 <- glm(y ~ x, data=survey, family=binomial). Type summary(fit3) to look at the coefficients. Fill in the following table. Enter your answers to 3 decimal places.
fit3 <- glm(y ~ x, data=survey, family=binomial)
summary(fit3)
The predicted probability of the model in (d) for each student is stored in the variable 'fit3$fitted.values'. You can make a plot of the predicted probability versus x by the command
plot(fit3$fitted.values ~ x, col=y+1, data=survey, pch="*")
You should see the predicted probabilities trace out a nice S-shaped curve. The command 'col=y+1' above tells R to plot the Trump supporters (y=1) in red and non-Trump supporters (y=0) in black. In R's base graphics, col=1 represents the black color and col=2 represents the red color. That's why we add 1 to the y variable. You can see that the Trump supporters tend to have larger values of x, as expected.
e. How does the one-variable model in (d) comapred to the 12-variable model in (a)? One way to measure the "goodness of fit" is to calculate McFadden's R2. Enter the values of McFadden's R2 for models in (a) and (d) to three decimal places.
McFadden's R2 for model in (a) =
McFadden's R2 for model in (d) =
f. Now focus on the model in (d). Since the slope is positive and x is between 0 and 10. The largest probability occurs at x=10 and the smallest probability occurs at x=0. And we see from the plot that the probability is almost 1 at x=10 and almost 0 at x=0. Do we have students whose answers to the survey questions correspond to these extreme values of x? What are the maximum and minimum values of x in the survey data? Enter your answers to 2 decimal places.
Maximum value of x in the data =
Minimum value of x in the data =
Find the students with the maximum and minimum values of x in the data and then answer the following questions.
g. What is the minimum-x student's ratings to the following variables?
Which presidental candidate did this student support? Donald Trump Gary Johnson Hillary Clinton Jill Stein Other Unsure
h. What is the maximum-x student's ratings to the following variables?