Browsing resource, all submissions are temporary.

Predicting Donald Trump Supporters

In this problem you will analyze the Stat 100 survey 3 data in Fall 2016. We are interested to see what variables are good predictors of students who supported Donald Trump. The data can be downloaded here and then loaded to R using the following command.

survey <- read.csv("stat100_2016fall_survey03.csv")

A description of the column variables can be found on this website.

The column named 'candidate' records students' answer to the question "If you had to vote (because there was a penalty if you didn't) which of the 2016 Presidential candidates would you vote for (assume you're eligible to vote even if you're not)".

Create a new column named 'y'. Set y to 1 for Donald Trump supporters and 0 for students who didn't support Trump.

Columns 21–31 correspond to students' responses on the importance of various political issues. These could be good predictors of the 'y' variable. After setting the 'y' variable, you can fit a logistic regression predicting the probability that a student is a Trump supporter P(y=1) from their ratings on these issues. You can use the command

fit1 <- glm(y ~ environment + terrorism + economy + racism + policeBrutality + lawAndOrder + genderEquality + borderSecurity + familyValues + bigMoney + gunRights + devil, data=survey, family=binomial)

to fit a logistic regression. In R, regression can also be carried out by putting the predictors in a matrix. For example, the following command does the same logistic regression as the command above.

fit1 <- glm(survey$y ~ as.matrix(survey[,c(21:31,3)]), family=binomial)

Note that we put 'devil' (column 3) in the model above because we find that Trump supporters tend to believe in devil.

a. Type summary(fit1) to see the coefficients. They are the coefficients of the ln(odds). Fill in the missing items in the following table. Enter your answers to 3 decimal places.

ln (odds)	Slopes	SE	Z	p-value
Intercept	-1.893	0.45078	-4.1999	2.6699e-05
environment	-0.131	0.040737	-3.2162	0.001299
terrorism	0.203	0.053094	3.8252	0.00013069
economy		0.047816	-4.8325	1.3483e-06
racism		0.053088	-2.5765	0.0099796
policeBrutality		0.046973	-1.3838	0.16641
lawAndOrder		0.048958	2.6924	0.0070941
genderEquality		0.041582	-2.6415	0.0082542
borderSecurity		0.041995	4.7461	2.0741e-06
familyValues		0.038003	2.0671	0.038727
bigMoney		0.042703	-1.979	0.04782
gunRights		0.034641	4.4333	9.2783e-06
devil		0.027126	3.1591	0.0015824

Tries 0/5

b. We see from the p-values that all slopes, except 'policeBrutality', are significant. The reason why 'policeBrutality' is not significant is that it is highly correlated with the variable 'racism'. Calculate the correlation between 'racism' and 'policeBrutality'. Enter your answer to 3 decimal places.

Correlation =

Tries 0/3

c. Knowing that 'policeBrutality' is highly correlated with 'racism', we can remove 'racism' from the model and see if 'policeBrutality' becomes significant. Type the following command:

fit2 <- glm(survey$y ~ as.matrix(survey[,c(21:23,25:31,3)]), family=binomial)

Type summary(fit2) and look at the p-value associated with 'policeBrutality'. Is the slope for 'policeBrutality' significant now?
no
yes

Tries 0/1

Go back to the model in (a). It has 12 variables and we know we can get rid of at least one variable without affecting the accuracy significantly. The question is if we can do better. Can we combined these variables and construct a simpler model?

We see that some slopes are negaive and some are positive. A negative slope means that the probability P(y=1) decreases when the value of the variable increases, keeping other variables fixed; and a positive slope means that P(y=1) increases when the value of the variable increases, keeping other variables fixed. All the variables represent ratings ranging from 0 to 10. We can reconstruct a new set of variables (x1, x2,..., x12) as follows.

x1 <- 10 - survey$environment
x2 <- survey$terrorism
x3 <- 10 - survey$economy
x4 <- 10 - survey$racism
x5 <- 10 - survey$policeBrutality
x6 <- survey$lawAndOrder
x7 <- 10 - survey$genderEquality
x8 <- survey$borderSecurity
x9 <- survey$familyValues
x10 <- 10 - survey$bigMoney
x11 <- survey$gunRights
x12 <- survey$devil

Copy and paste the code to your R console. These new variables are constructed to reverse the direction of the ratings for the negative slope variables. For example, x1 is set to 10 - environment. The rating x1=0 means Environmental Protection is extremely important to the student and x1=10 means Environmental Protection is not at all important to the student. If we now fit a logistic regression predicting P(y=1) from x1, x2, ..., x12, all the slopes become positive. This means that the larger the value of any one of the new variables, the more likely the student is a Trump supporter. However, the new model is still the same as the old model. Now we construct a new variable x which is the average of the x1, x2,..., and x12. We create a new column in the data frame using the following command:

survey$x <- (x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12)/12

This x variable can range from 0 to 10, and it combines all the ratings relevant to whether or not a student is a Trump supporter.

d. Fit a logistic regression predicting y from x using the command fit3 <- glm(y ~ x, data=survey, family=binomial). Type summary(fit3) to look at the coefficients. Fill in the following table. Enter your answers to 3 decimal places.

ln (odds)	Slopes	SE	Z	p-value
Intercept		0.57	-15.57	<2e-16
x		0.1119	13.85	<2e-16

Tries 0/5

The predicted probability of the model in (d) for each student is stored in the variable 'fit3$fitted.values'. You can make a plot of the predicted probability versus x by the command

plot(fit3$fitted.values ~ x, col=y+1, data=survey, pch="*")

You should see the predicted probabilities trace out a nice S-shaped curve. The command 'col=y+1' above tells R to plot the Trump supporters (y=1) in red and non-Trump supporters (y=0) in black. In R's base graphics, col=1 represents the black color and col=2 represents the red color. That's why we add 1 to the y variable. You can see that the Trump supporters tend to have larger values of x, as expected.

e. How does the one-variable model in (d) comapred to the 12-variable model in (a)? One way to measure the "goodness of fit" is to calculate McFadden's R². Enter the values of McFadden's R² for models in (a) and (d) to three decimal places.

McFadden's R² for model in (a) =

McFadden's R² for model in (d) =

Tries 0/5

f. Now focus on the model in (d). Since the slope is positive and x is between 0 and 10. The largest probability occurs at x=10 and the smallest probability occurs at x=0. And we see from the plot that the probability is almost 1 at x=10 and almost 0 at x=0. Do we have students whose answers to the survey questions correspond to these extreme values of x? What are the maximum and minimum values of x in the survey data? Enter your answers to 2 decimal places.

Maximum value of x in the data =

Minimum value of x in the data =

Tries 0/3

Find the students with the maximum and minimum values of x in the data and then answer the following questions.

g. What is the minimum-x student's ratings to the following variables?

variable	rating
environment
terrorism
economy
racism
policeBrutality
lawAndOrder
genderEquality
borderSecurity
familyValues
bigMoney
gunRights
devil

Tries 0/5

Which presidental candidate did this student support?
Donald Trump
Gary Johnson
Hillary Clinton
Jill Stein
Other
Unsure

Tries 0/2

h. What is the maximum-x student's ratings to the following variables?

variable	rating
environment
terrorism
economy
racism
policeBrutality
lawAndOrder
genderEquality
borderSecurity
familyValues
bigMoney
gunRights
devil

Tries 0/5