Setup Code:

hide

#load packages tidyverse and rstanarm
library(tidyverse)
library(rstanarm)
library(kableExtra)
#load earnings dataset from desired file path
earnings <- read.csv("earnings.csv")
kable(head(earnings))

height	weight	male	earn	earnk	ethnicity	education	mother_education	father_education	walk	exercise	smokenow	tense	angry	age
74	210	1	50000	50	White	16	16	16	3	3	2	0	0	45
66	125	0	60000	60	White	16	16	16	6	5	1	0	0	58
64	126	0	30000	30	White	16	16	16	8	1	2	1	1	29
65	200	0	25000	25	White	17	17	NA	8	1	2	0	0	57
63	110	0	50000	50	Other	16	16	16	5	6	2	0	0	91
68	165	0	62000	62	Black	18	18	18	1	1	2	2	2	54

6.2 Programming Fake-Data Simulation: Write an R function to:

simulate n data point from the model, y= a+bx+error,with data points x uniformly sampled from the range(0,100) with errors drawn independently from the normal distribution with mean 0 and standard deviation σ*
fit a linear regression to the simulated data*
make a scatterplot of the data and fitted line

hide

set.seed(3)
#Create a function for a fake data simulation
fake_data <- function(a,b,sigma,n)#Think of this as variables
#The curly brackets are what inputs go into and what outputs they generate
{
#create object x and define it using runif() function
x <- runif(n,0,100)
#add in simple regression 
y <- a+b*x+sigma*rnorm(n)
random <- data.frame(x,y)
#Use Stan_glm function to fit the model
fitted_random <- stan_glm(y~x, data = random)
#use print to display results concisely 
print(fitted_random, digits=3)
plot(random$x,random$y,main = "Data generated and fitted regression line")
a_hat<- coef(fitted_random) [1]
b_hat <- coef(fitted_random)[2]
abline(a_hat, b_hat)
x_bar <-mean(random$x)
text(x_bar, a_hat + b_hat*x_bar,
paste("y =",round(a_hat,2),"+",round(b_hat, 2), "* x"), adj=0) 
}

hide

set.seed(3)
fake_data(0.4,0.2,0.5,100)

#6.3 Variation, uncertainty, and sample size: Repeat the example in Section 6.2, varying the number of data points, n. What happens to the parameter estimates and uncertainties when you increase the number of observations?

hide

set.seed(3)
fake_data(0.4,0.2,0.5,175)

hide

set.seed(3)
fake_data(0.4,0.2,0.5,215)

hide

set.seed(3)
fake_data(0.4,0.2,0.5,275)

As I ran each code chunk by increasing the value of n in my sample, I observed that the MAD_SD values or the standard deviation started reducing. This suggests that a larger sample size leads to less error

#6.5 Regression prediction and averages: The heights and earnings data in Section 6.3 are in the folder Earnings. Download the data and compute the average height for men and women in the sample.

Assuming 52% of adults are women, estimate the average earnings of adults in the population.
Directly from the sample data compute the average earnings of men, women, and everyone. Compare these to the values calculated in parts (a) and (b)

hide

#look at the columns in the dataset
kable(head(earnings))

height	weight	male	earn	earnk	ethnicity	education	mother_education	father_education	walk	exercise	smokenow	tense	angry	age
74	210	1	50000	50	White	16	16	16	3	3	2	0	0	45
66	125	0	60000	60	White	16	16	16	6	5	1	0	0	58
64	126	0	30000	30	White	16	16	16	8	1	2	1	1	29
65	200	0	25000	25	White	17	17	NA	8	1	2	0	0	57
63	110	0	50000	50	Other	16	16	16	5	6	2	0	0	91
68	165	0	62000	62	Black	18	18	18	1	1	2	2	2	54

hide

#base R rules dataframe$column
#Find the means 
men_mean <- mean(earnings$height[earnings$male==1]) 
women_mean <- mean(earnings$height[earnings$male==0])

70.089 for men, 64.490 for women

Use these averages and fitted regression model displayed on page 84 to get a model-based estimate of the average earnings of men and of women in the population.

hide

set.seed(3)
#Use Stan_glm function to fit the model
fitted_regression <- stan_glm( earn ~ height + male, data= earnings)


SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 1.05107 seconds (Warm-up)
Chain 1:                0.174895 seconds (Sampling)
Chain 1:                1.22597 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.647935 seconds (Warm-up)
Chain 2:                0.175548 seconds (Sampling)
Chain 2:                0.823483 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.764889 seconds (Warm-up)
Chain 3:                0.190876 seconds (Sampling)
Chain 3:                0.955765 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.3e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.265536 seconds (Warm-up)
Chain 4:                0.175645 seconds (Sampling)
Chain 4:                0.441181 seconds (Total)
Chain 4:

hide

#use print to display results concisely 
print(fitted_regression, digits=2)

stan_glm
 family:       gaussian [identity]
 formula:      earn ~ height + male
 observations: 1816
 predictors:   3
------
            Median    MAD_SD   
(Intercept) -25748.75  11944.66
height         646.07    186.17
male         10606.51   1471.54

Auxiliary parameter(s):
      Median   MAD_SD  
sigma 21406.93   358.07

------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg

Homework 1: Function Writing, Variation

Setup Code:

6.2 Programming Fake-Data Simulation: Write an R function to: