Homework 2: Exploratory Data Analysis

Exploratory Data Analysis on Voter Turnout and Ethnic Data and Gelman Chapter 10

Isha Akshita Mahajan https://example.com/norajones
2022-04-03

Load Required Packages

hide

Question 1

Fraga analyzes turnout data for four different racial and ethnic groups, but for this analysis we will focus on the data for black voters. Load blackturnout.csv. Which years are included in the dataset? How many different states are included in the dataset?

hide
voter_data<-read_csv("blackturnout.csv")
head(voter_data)
# A tibble: 6 × 7
   ...1  year state district turnout   CVAP candidate
  <dbl> <dbl> <chr>    <dbl>   <dbl>  <dbl>     <dbl>
1     1  2008 AK           0   0.710 0.0350         0
2     2  2010 AK           0   0.448 0.0323         0
3     3  2010 AK           1   0.448 0.0323         0
4     4  2008 AK           1   0.710 0.0350         0
5     5  2006 AK           1   0.439 0.0318         0
6     6  2010 AL           0   0.397 0.256          0
hide
years<- voter_data %>% 
 select(year) %>% 
 count(year)

states<- voter_data %>% 
  select(state) %>% 
  count(state)

This dataset includes data from years 2006,2008 and 2010 and includes information on voter turnout from 42 states

Question 2:

Create a boxplot that compares turnout in elections with and without a co-ethnic candidate.Be sure to use informative labels.Interpret the resulting graph.

hide
library(ggplot2)
library(ggthemes)

boxplot<- voter_data %>% 
  group_by(candidate) %>% 
  select(candidate,turnout)%>% 
  mutate(candidate = recode(candidate, `1` = "Black Candidate", `0` = "No Black Candidate")) %>% 
  
  ggplot(aes(x=candidate,y=turnout,group=candidate, fill=candidate))+
  geom_boxplot(show.legend = TRUE)+
  labs(x="Ethnicity", y= "Black Voter Turnout", title = "Elections With & Without Black Candidates", subtitle= "Boxplot for elections with and without the presence of black candidates", caption="Graphic: Isha Akshita Mahajan/ Student,UMass Amherst\nSource: YouGov")+
  theme_minimal()+
  theme(legend.position = "top", legend.box = "horizontal")+
  theme(text=element_text (size = 12),
  plot.title =element_text(size=rel(1.5)))
  boxplot

This boxplot shows the difference between the turnout for elections with and without co-ethnic candidates. The median for black voter turnout is lower for non co ethnic candidates whereas there is a higher turnout in elections where there is presence of co ethnic candidates. This can be considered as a step towards correlation however,this boxplot can not represent causality on it’s own.

Question 3:

Run a linear regression with black turnout as your outcome variable and candidate co-ethnicity as your predictor. Report the coefficient on your predictor and the intercept. Interpret these coefficients. Do not merely comment on the direction of the association (i.e.,whether the slope is positive or negative). Explain what the value of the coefficients mean in terms of the units in which each variable is measured. Based on these coefficients, what would you conclude about blacks voter turnout and co-ethnic candidates?

hide
set.seed(007)
regression_3 <- lm(turnout~candidate, data = voter_data)
summary(regression_3)

Call:
lm(formula = turnout ~ candidate, data = voter_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3164 -0.1282 -0.0436  0.1191  0.5832 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.393857   0.005183  75.993  < 2e-16 ***
candidate   0.061640   0.014984   4.114 4.15e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.171 on 1235 degrees of freedom
Multiple R-squared:  0.01352,   Adjusted R-squared:  0.01272 
F-statistic: 16.92 on 1 and 1235 DF,  p-value: 4.149e-05
hide
#plot(turnout~candidate, data = voter_data)
#abline(regression_3)

This model summarizes the difference in average black voter turnout between elections that have black candidates (1) and elections that don’t (0). The intercept, 0.39, is the average or predicted proportion of black voting age population in a district that votes in a general election when the election does not include a co-ethnic/black candidate. To average the proportion of black voting age population when the election does include a co-ethnic/black candidate, we would add 0.06 i.e. 0.39+0.06 to get 0.45 which is the predicted turnout of elections with co-ethnic candidates. This tells us that the turnout for election with black candidate is on average, 0.06 units higher than the one without a co-ethnic candidates. This regression has a positive slope and can be interpreted as a summary of the difference in average voter turnout for elections with and without black candidates.

Question 4:

You decide to investigate the results of the previous question a bit more carefully because the elections with co-ethnic candidates may differ from the elections without co-ethnic candidates in other ways. Create a scatter plot where the x-axis is the proportion of co-ethnic voting-age population and the y-axis is black voter turnout. Color your points according to candidate co-ethnicity. That is, make the points for elections featuring co-ethnic candidates one color, and make the points for elections featuring no co-ethnic candidates a different color. Interpret the graph.

hide
scatterplot<- voter_data %>% 
  group_by(candidate) %>% 
  select(candidate,turnout,CVAP)%>%
 mutate(candidate = recode(candidate, `1` = "Black Candidate", `0` = "No Black Candidate")) %>%  
  
  ggplot(aes(x=CVAP,y=turnout, colour=candidate,fill=candidate))+
  geom_point()+
 labs(x= "Proportion of Eligible Voters Who Are Black", y= "Black Voter Turnout", title = "Elections With and Without Black Candidates ", subtitle = "When there no black candidates running for elections, the proportion of eligible voters who show up\n to vote is relatively lower to when there is an election that includes a co-ethnic candidate",caption="Graphic: Isha Akshita Mahajan/ Student,UMass Amherst\nSource: YouGov")+
  theme(legend.position="bottom")+
  theme_minimal()+
 theme(text=element_text (size = 12,hjust = 0.5),
       plot.subtitle = element_text(size = 10))
scatterplot

This scatterplot shows that when there no black candidates running for elections, the proportion of eligible voters who show up to vote is relatively lower to when there is an election that includes a co-ethnic candidate.This can be considered as an additional step towards analyzing black voter turnout, however, we can not make strong causal claims.

I would like to expand on Clustering around low CVAP and low turnout but I’m still a little confused on that part

Question 5:

Run a linear regression with black turnout as your outcome variable and with candidate co-ethnicity and co-ethnic voting-age population as your predictors.Report the coefficients, including the intercept. Interpret the coefficients on the two predictors, ignoring the intercept for now (you will interpret the intercept in the next question). Explain what each coefficient represents in terms of the units of the relevant variables.

hide
set.seed(007)
regression_5 <- lm(turnout~candidate + CVAP, data = voter_data, refresh=0)
summary(regression_5)

Call:
lm(formula = turnout ~ candidate + CVAP, data = voter_data, refresh = 0)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.30534 -0.12775 -0.04529  0.11750  0.59576 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.375275   0.006677  56.203  < 2e-16 ***
candidate   -0.007364   0.021703  -0.339    0.734    
CVAP         0.207392   0.047497   4.366 1.37e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1698 on 1234 degrees of freedom
Multiple R-squared:  0.02853,   Adjusted R-squared:  0.02695 
F-statistic: 18.12 on 2 and 1234 DF,  p-value: 1.756e-08
hide
#plot(voter_data$candidate, voter_data$CVAP)

For every one unit of increase in the proportion of district’s voting age population that is black, the black voter turnout increases by 0.21 units, on average, given there are no co-ethnic candidates.In the regression results above, we see that the coefficient for candidate is negative and insignificant (no stars), therefore the value does not hold much relevance in the model. In the earlier regression with only one predictor-turnout, there was an average increase in black voter turnout by 0.06 units, however, with the addition of another predictor,the value of that seemed to have diminished.

Question 6:

Now interpret the intercept from the regression model with two predictors. Is this intercept a substantively important or interesting quantity? Why or why not?

The intercept in an interesting quantity for two reasons:First, the regression results show that it is a significant value therefore it shall be an integral part of the model. Secondly, On average, the intercept is 0.37 which in comparison to the model with one predictor is only 0.02 units lower. This comparison shows that the intercept is in alignment with the fit regardless of the turnout predictor

Question 7:

Relationship between co-ethnic candidates and black voter turnout. Based on regression model with two predictors, what do you conclude about the relationship between co-ethnic candidates and black voter turnout. Ignore issues of statistical significance.

Keeping aside the significance, the coefficient value of candidate is -0.007 which is very minuscule and therefore does not have a significant impact on the intercept and the model itself. I thought this to be in alignment with null hypothesis and might contribute towards making an argument for that stronger?

Questions From RaOS

10.2 Regression with Interactions

  1. Write the equation of the estimated regression line of y on x for the treatment group and the control group, and the equation of the estimated regression line of y on x for the control group.

  2. Graph with pen on paper the two regression lines, assuming the values of x fall in the range (0, 10). On this graph also include a scatterplot of data (using open circles for treated units and dots for controls) that are consistent with the fitted model.

Part A

x=1.6 (Pre Treatment Predictor) z= 2.7 (Treatment Indicator) x:z = 0.7 sigma = 0.5

When z=0, then y= a+bx+cz+d(x:z) = 1.2+1.6x+2.7(0)+0.7(x)(0) = 1.2+1.6x

When z=1, then y= 1.6x+1.2+2.7(1)+0.7(x)(1) = 3.9+2.3x

In Treatment, for every one unit increase in x, y increases by an average of 0.7 units in control group.

PART B

hide
x<- runif(100,0,10)
z<-c(0,1)
sigma<- rnorm(100)
y<- 1.2+1.6*x+2.7*z+0.7*x:z+sigma

fakedata<- data.frame(x,y,z) 

plot_data <- fakedata %>% 
  mutate(z = recode(z, `1` = "Treatment", `0` = "Control")) %>% 
  
  ggplot(aes(x,y,factor(z)))+
  geom_point(aes(colour=z,fill=z))+
  geom_abline(slope = 2.3, intercept=3.9)+
  geom_abline(slope = 1.6, intercept=1.2)+
  xlim(0,10)+
  ylim(0,20)+
  labs(x= "X", y= "Y", title = " Regression with Interactions",subtitle= "For every one unit of increase in x in treatment, y increase by 0.7 units in contol group on average", caption= "Graphic: Isha Akshita Mahajan/ Student,UMass Amherst\nSource: Fake Data")+
  theme_minimal()+
  theme(legend.position="bottom")+
  theme(text=element_text (size = 10),
  plot.title = element_text(size=rel(1.5)))

  plot_data

Question 10.3

hide
var1 <- rnorm(1000, 0, 1)  
var2 <- rnorm(1000, 0, 1)  
fake <- data.frame(var1, var2)  
fit_lm <- lm(var2 ~ var1, data=fake)
summary(fit_lm,)

Call:
lm(formula = var2 ~ var1, data = fake)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1708 -0.6723  0.0279  0.5928  3.4590 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.01832    0.03222   0.569    0.570
var1         0.02115    0.03245   0.652    0.515

Residual standard error: 1.019 on 998 degrees of freedom
Multiple R-squared:  0.0004253, Adjusted R-squared:  -0.0005763 
F-statistic: 0.4247 on 1 and 998 DF,  p-value: 0.5148
hide
var3 <- rnorm(1000, 0, 1)  
var4 <- rnorm(1000, 0, 1)  
fake_1 <- data.frame(var3, var4)  
fit_lm_1 <- lm(var4 ~ var3, data=fake_1)
summary(fit_lm_1)

Call:
lm(formula = var4 ~ var3, data = fake_1)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2702 -0.6763 -0.0161  0.6761  2.8278 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01769    0.03142  -0.563    0.574
var3         0.03460    0.03185   1.086    0.278

Residual standard error: 0.9934 on 998 degrees of freedom
Multiple R-squared:  0.001181,  Adjusted R-squared:  0.0001804 
F-statistic:  1.18 on 1 and 998 DF,  p-value: 0.2776
hide
var5 <- rnorm(1000, 0, 1)  
var6 <- rnorm(1000, 0, 1)  
fake_2 <- data.frame(var3, var4)  
fit_lm_2 <- lm(var6 ~ var5, data=fake_2)
summary(fit_lm_2)

Call:
lm(formula = var6 ~ var5, data = fake_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5125 -0.7104  0.0315  0.7026  3.7598 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.034537   0.031892   1.083    0.279
var5        -0.005155   0.031275  -0.165    0.869

Residual standard error: 1.008 on 998 degrees of freedom
Multiple R-squared:  2.722e-05, Adjusted R-squared:  -0.0009748 
F-statistic: 0.02717 on 1 and 998 DF,  p-value: 0.8691

Question 10.6

Regression models with interactions: The folder Beauty contains data (use file beauty.csv) Beauty and teaching evaluations from Hamermesh and Parker (2005) on student evaluations of instructors’ beauty and teaching quality for several courses at the University of Texas. The teaching evaluations were conducted at the end of the semester, and the beauty judgments were made later, by six students who had not attended the classes and were not aware of the course evaluations.

PART A

  1. Run a regression using beauty (the variable beauty) to predict course evaluations (eval), adjusting for various other predictors. Graph the data and fitted model, and explain the meaning of each of the coefficients along with the residual standard deviation. Plot the residuals versus fitted values.
hide
beauty<- read.csv("beauty.csv")
fit_beauty<- lm(eval~beauty, data= beauty)
summary(fit_beauty)

Call:
lm(formula = eval ~ beauty, data = beauty)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.80015 -0.36304  0.07254  0.40207  1.10373 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.01002    0.02551 157.205  < 2e-16 ***
beauty       0.13300    0.03218   4.133 4.25e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5455 on 461 degrees of freedom
Multiple R-squared:  0.03574,   Adjusted R-squared:  0.03364 
F-statistic: 17.08 on 1 and 461 DF,  p-value: 4.247e-05
hide
plot(fit_beauty)

The regression above shows that for every one unit increase in beauty, the evaluations increase by 0.13 units, on average. Both the intercept and the beauty variable hold statistical significance in the model.

PART B

  1. Fit some other models, including beauty and also other predictors. Consider at least one model with interactions. For each model, explain the meaning of each of its estimated coefficients.
hide
beauty<- read.csv("beauty.csv")
fit_1<- lm(eval~beauty+female, data= beauty)
summary(fit_1)

Call:
lm(formula = eval ~ beauty + female, data = beauty)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.87196 -0.36913  0.03493  0.39919  1.03237 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.09471    0.03328  123.03  < 2e-16 ***
beauty       0.14859    0.03195    4.65 4.34e-06 ***
female      -0.19781    0.05098   -3.88  0.00012 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5373 on 460 degrees of freedom
Multiple R-squared:  0.0663,    Adjusted R-squared:  0.06224 
F-statistic: 16.33 on 2 and 460 DF,  p-value: 1.407e-07

When beauty=0, i.e. male, the female coefficient is -0.20 which shows a decrease of 0.20 units in evaluations on average

hide
fit_2<- lm(beauty~eval+female+age+female:age,data=beauty)
summary(fit_2)

Call:
lm(formula = beauty ~ eval + female + age + female:age, data = beauty)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.74095 -0.53825 -0.09866  0.47084  1.89433 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.044655   0.353136  -0.126    0.899    
eval         0.272099   0.063346   4.295 2.13e-05 ***
female      -0.279176   0.370222  -0.754    0.451    
age         -0.024344   0.004535  -5.368 1.26e-07 ***
female:age   0.008601   0.007735   1.112    0.267    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7402 on 458 degrees of freedom
Multiple R-squared:  0.1267,    Adjusted R-squared:  0.1191 
F-statistic: 16.62 on 4 and 458 DF,  p-value: 1e-12