Homework 1: Function Writing, Variation

Gelman Chapter 6

Isha Akshita Mahajan (UMass Amherst)
2022-04-03

Setup Code:

hide
#load packages tidyverse and rstanarm
library(tidyverse)
library(rstanarm)
library(kableExtra)
#load earnings dataset from desired file path
earnings <- read.csv("earnings.csv")
kable(head(earnings))
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age
74 210 1 50000 50 White 16 16 16 3 3 2 0 0 45
66 125 0 60000 60 White 16 16 16 6 5 1 0 0 58
64 126 0 30000 30 White 16 16 16 8 1 2 1 1 29
65 200 0 25000 25 White 17 17 NA 8 1 2 0 0 57
63 110 0 50000 50 Other 16 16 16 5 6 2 0 0 91
68 165 0 62000 62 Black 18 18 18 1 1 2 2 2 54

6.2 Programming Fake-Data Simulation: Write an R function to:

  1. simulate n data point from the model, y= a+bx+error,with data points x uniformly sampled from the range(0,100) with errors drawn independently from the normal distribution with mean 0 and standard deviation σ*

  2. fit a linear regression to the simulated data*

  3. make a scatterplot of the data and fitted line

hide
set.seed(3)
#Create a function for a fake data simulation
fake_data <- function(a,b,sigma,n)#Think of this as variables
#The curly brackets are what inputs go into and what outputs they generate
{
#create object x and define it using runif() function
x <- runif(n,0,100)
#add in simple regression 
y <- a+b*x+sigma*rnorm(n)
random <- data.frame(x,y)
#Use Stan_glm function to fit the model
fitted_random <- stan_glm(y~x, data = random)
#use print to display results concisely 
print(fitted_random, digits=3)
plot(random$x,random$y,main = "Data generated and fitted regression line")
a_hat<- coef(fitted_random) [1]
b_hat <- coef(fitted_random)[2]
abline(a_hat, b_hat)
x_bar <-mean(random$x)
text(x_bar, a_hat + b_hat*x_bar,
paste("y =",round(a_hat,2),"+",round(b_hat, 2), "* x"), adj=0) 
}
hide
set.seed(3)
fake_data(0.4,0.2,0.5,100)

#6.3 Variation, uncertainty, and sample size: Repeat the example in Section 6.2, varying the number of data points, n. What happens to the parameter estimates and uncertainties when you increase the number of observations?

hide
set.seed(3)
fake_data(0.4,0.2,0.5,175)

hide
set.seed(3)
fake_data(0.4,0.2,0.5,215)

hide
set.seed(3)
fake_data(0.4,0.2,0.5,275)

As I ran each code chunk by increasing the value of n in my sample, I observed that the MAD_SD values or the standard deviation started reducing. This suggests that a larger sample size leads to less error

#6.5 Regression prediction and averages: The heights and earnings data in Section 6.3 are in the folder Earnings. Download the data and compute the average height for men and women in the sample.

  1. Assuming 52% of adults are women, estimate the average earnings of adults in the population.
  2. Directly from the sample data compute the average earnings of men, women, and everyone. Compare these to the values calculated in parts (a) and (b)
hide
#look at the columns in the dataset
kable(head(earnings))
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age
74 210 1 50000 50 White 16 16 16 3 3 2 0 0 45
66 125 0 60000 60 White 16 16 16 6 5 1 0 0 58
64 126 0 30000 30 White 16 16 16 8 1 2 1 1 29
65 200 0 25000 25 White 17 17 NA 8 1 2 0 0 57
63 110 0 50000 50 Other 16 16 16 5 6 2 0 0 91
68 165 0 62000 62 Black 18 18 18 1 1 2 2 2 54
hide
#base R rules dataframe$column
#Find the means 
men_mean <- mean(earnings$height[earnings$male==1]) 
women_mean <- mean(earnings$height[earnings$male==0])

70.089 for men, 64.490 for women

Use these averages and fitted regression model displayed on page 84 to get a model-based estimate of the average earnings of men and of women in the population.

hide
set.seed(3)
#Use Stan_glm function to fit the model
fitted_regression <- stan_glm( earn ~ height + male, data= earnings)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.05107 seconds (Warm-up)
Chain 1:                0.174895 seconds (Sampling)
Chain 1:                1.22597 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.647935 seconds (Warm-up)
Chain 2:                0.175548 seconds (Sampling)
Chain 2:                0.823483 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.764889 seconds (Warm-up)
Chain 3:                0.190876 seconds (Sampling)
Chain 3:                0.955765 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.3e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.265536 seconds (Warm-up)
Chain 4:                0.175645 seconds (Sampling)
Chain 4:                0.441181 seconds (Total)
Chain 4: 
hide
#use print to display results concisely 
print(fitted_regression, digits=2)
stan_glm
 family:       gaussian [identity]
 formula:      earn ~ height + male
 observations: 1816
 predictors:   3
------
            Median    MAD_SD   
(Intercept) -25748.75  11944.66
height         646.07    186.17
male         10606.51   1471.54

Auxiliary parameter(s):
      Median   MAD_SD  
sigma 21406.93   358.07

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg