Gelman Chapter 6
height | weight | male | earn | earnk | ethnicity | education | mother_education | father_education | walk | exercise | smokenow | tense | angry | age |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
74 | 210 | 1 | 50000 | 50 | White | 16 | 16 | 16 | 3 | 3 | 2 | 0 | 0 | 45 |
66 | 125 | 0 | 60000 | 60 | White | 16 | 16 | 16 | 6 | 5 | 1 | 0 | 0 | 58 |
64 | 126 | 0 | 30000 | 30 | White | 16 | 16 | 16 | 8 | 1 | 2 | 1 | 1 | 29 |
65 | 200 | 0 | 25000 | 25 | White | 17 | 17 | NA | 8 | 1 | 2 | 0 | 0 | 57 |
63 | 110 | 0 | 50000 | 50 | Other | 16 | 16 | 16 | 5 | 6 | 2 | 0 | 0 | 91 |
68 | 165 | 0 | 62000 | 62 | Black | 18 | 18 | 18 | 1 | 1 | 2 | 2 | 2 | 54 |
simulate n data point from the model, y= a+bx+error,with data points x uniformly sampled from the range(0,100) with errors drawn independently from the normal distribution with mean 0 and standard deviation σ*
fit a linear regression to the simulated data*
make a scatterplot of the data and fitted line
set.seed(3)
#Create a function for a fake data simulation
fake_data <- function(a,b,sigma,n)#Think of this as variables
#The curly brackets are what inputs go into and what outputs they generate
{
#create object x and define it using runif() function
x <- runif(n,0,100)
#add in simple regression
y <- a+b*x+sigma*rnorm(n)
random <- data.frame(x,y)
#Use Stan_glm function to fit the model
fitted_random <- stan_glm(y~x, data = random)
#use print to display results concisely
print(fitted_random, digits=3)
plot(random$x,random$y,main = "Data generated and fitted regression line")
a_hat<- coef(fitted_random) [1]
b_hat <- coef(fitted_random)[2]
abline(a_hat, b_hat)
x_bar <-mean(random$x)
text(x_bar, a_hat + b_hat*x_bar,
paste("y =",round(a_hat,2),"+",round(b_hat, 2), "* x"), adj=0)
}
set.seed(3)
fake_data(0.4,0.2,0.5,100)
#6.3 Variation, uncertainty, and sample size: Repeat the example in Section 6.2, varying the number of data points, n. What happens to the parameter estimates and uncertainties when you increase the number of observations?
set.seed(3)
fake_data(0.4,0.2,0.5,175)
set.seed(3)
fake_data(0.4,0.2,0.5,215)
set.seed(3)
fake_data(0.4,0.2,0.5,275)
As I ran each code chunk by increasing the value of n in my sample, I observed that the MAD_SD values or the standard deviation started reducing. This suggests that a larger sample size leads to less error
#6.5 Regression prediction and averages: The heights and earnings data in Section 6.3 are in the folder Earnings. Download the data and compute the average height for men and women in the sample.
height | weight | male | earn | earnk | ethnicity | education | mother_education | father_education | walk | exercise | smokenow | tense | angry | age |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
74 | 210 | 1 | 50000 | 50 | White | 16 | 16 | 16 | 3 | 3 | 2 | 0 | 0 | 45 |
66 | 125 | 0 | 60000 | 60 | White | 16 | 16 | 16 | 6 | 5 | 1 | 0 | 0 | 58 |
64 | 126 | 0 | 30000 | 30 | White | 16 | 16 | 16 | 8 | 1 | 2 | 1 | 1 | 29 |
65 | 200 | 0 | 25000 | 25 | White | 17 | 17 | NA | 8 | 1 | 2 | 0 | 0 | 57 |
63 | 110 | 0 | 50000 | 50 | Other | 16 | 16 | 16 | 5 | 6 | 2 | 0 | 0 | 91 |
68 | 165 | 0 | 62000 | 62 | Black | 18 | 18 | 18 | 1 | 1 | 2 | 2 | 2 | 54 |
70.089 for men, 64.490 for women
Use these averages and fitted regression model displayed on page 84 to get a model-based estimate of the average earnings of men and of women in the population.
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 1.05107 seconds (Warm-up)
Chain 1: 0.174895 seconds (Sampling)
Chain 1: 1.22597 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.647935 seconds (Warm-up)
Chain 2: 0.175548 seconds (Sampling)
Chain 2: 0.823483 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.1 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.764889 seconds (Warm-up)
Chain 3: 0.190876 seconds (Sampling)
Chain 3: 0.955765 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.3e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.265536 seconds (Warm-up)
Chain 4: 0.175645 seconds (Sampling)
Chain 4: 0.441181 seconds (Total)
Chain 4:
#use print to display results concisely
print(fitted_regression, digits=2)
stan_glm
family: gaussian [identity]
formula: earn ~ height + male
observations: 1816
predictors: 3
------
Median MAD_SD
(Intercept) -25748.75 11944.66
height 646.07 186.17
male 10606.51 1471.54
Auxiliary parameter(s):
Median MAD_SD
sigma 21406.93 358.07
------
* For help interpreting the printed output see ?print.stanreg
* For info on the priors used see ?prior_summary.stanreg