---
title: "Adjusting regression models for overfitting in applied linguistics research"
author: "Phillip Hamrick, Ph.D., Associate Professor, PI, Language and Cognition Research Laboratory, Kent State University"
date: "November 19, 2018"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r load libraries and data}
library(rms)
library(ggplot2)
```

``` {r simulation}
### This simulation looks at accuracy as a function of length of residence, with a simple focus on showing how an original model can overfit the data.

set.seed(0717) #sets the random replicable start point

d <-read.csv("simulation.csv") #load data frame

print(d) #view the data frame

mod <-ols(accuracy ~ lor, data = d, x = TRUE, y = TRUE) #construct regression model

mod #original fit R2 = 0.40, R2 adj = 0.378, g = 10.51

ggplot(d, aes(x=lor, y=accuracy)) + geom_point(size=3) + geom_smooth(method="lm", se=FALSE) #no extreme outliers, but a few possible or some influential cases

validate(mod, method="boot", B=5000) 
#optimism = 0.0540 (5% of the original variance was from overfitting), corrected R2 = .3457

```