--- title: "Adjusting regression models for overfitting in applied linguistics research" author: "Phillip Hamrick, Ph.D., Associate Professor, PI, Language and Cognition Research Laboratory, Kent State University" date: "November 19, 2018" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` ```{r load libraries and data} library(rms) library(ggplot2) ``` ``` {r simulation} ### This simulation looks at accuracy as a function of length of residence, with a simple focus on showing how an original model can overfit the data. set.seed(0717) #sets the random replicable start point d <-read.csv("simulation.csv") #load data frame print(d) #view the data frame mod <-ols(accuracy ~ lor, data = d, x = TRUE, y = TRUE) #construct regression model mod #original fit R2 = 0.40, R2 adj = 0.378, g = 10.51 ggplot(d, aes(x=lor, y=accuracy)) + geom_point(size=3) + geom_smooth(method="lm", se=FALSE) #no extreme outliers, but a few possible or some influential cases validate(mod, method="boot", B=5000) #optimism = 0.0540 (5% of the original variance was from overfitting), corrected R2 = .3457 ```