NCAA football model building part II

This is part two of our NCAA football model building. You can find part one here. Now is when it gets a little more exciting. The steps for today are quite simple:

read in data scraped in part 1
add a few variables
clean up the data
prepare for the dirty, dirty matching and modeling to come

Read in the old

Your author took liberties to save down .rds files of the data scraped in part one. We must import these files to begin.

ncaa_import <- readRDS("../../static/ncaa_import.rds")
fo_data <- readRDS("../../static/fo_data.rds")

(By the way, if you have questions about saving down files for import on blogdown, see here for helpful tips.)

Passer rating time

To build our models, we plan to capture offensive and defense team statistics, for the running and passing game. We might also use a pace statistic but don’t rush me here. The theory goes that we will use a combination of these statistics to predict lines, compare those lines to the sportsbooks and examine the differences where we see outliers.

To capture the passing game efficiently we use the old passer rating formula. We have taken the coefficients and weights from Wikipedia, venerable sabermetric almanac that it is.

Since this applies to both offense and defense, let’s write a function to calculate the passer rating for us.

passer_rating <- function(df, pass_com, pass_att, yds_per_att, pass_td, int) {
  
  a <- (df[[pass_com]] / df[[pass_att]] - 0.3) * 5
  b <- (df[[yds_per_att]] - 3) * 0.25
  c <- (df[[pass_td]] / df[[pass_att]]) * 20
  d <- 2.375 - (df[[int]] / df[[pass_att]] * 25)
  
  passer_rating <- ((a + b + c + d) / 6) * 100
  
  rm(a, b, c, d)
  
  return(passer_rating)
  
}

Ain’t she a beaut!? Not really. In any case, we will now take our tidy data, make it wide, calculate offensive and defensive passer rating stats, select variables of interest from the NCAA data and massage our quads in ebullient celebration.

# load, spread, calc

library(tidyverse)

ncaa_df <- ncaa_import %>% 
  mutate(cat_stat = paste0(category, "_", stat)) %>% # 'cause we can't have duplicate col names
  select(-category, -stat) %>% 
  spread(cat_stat, value) %>% 
  mutate(pass_o_pass_rt = passer_rating(.,
                                          "pass_o_pass_com", 
                                          "pass_o_pass_att", 
                                          "pass_o_yds_per_att", 
                                          "pass_o_pass_td", 
                                          "pass_o_int"),
         pass_d_pass_rt = passer_rating(.,
                                          "pass_d_yds_opp_cpl",
                                          "pass_d_yds_opp_pass",
                                          "pass_d_yds_yds_per_att",
                                          "pass_d_yds_opp_pass_tds",
                                          "pass_d_int_opp_int"))

Now we’ve got to select variables of interest: no mean feat on 42 variables. Here are the options.

names(ncaa_df)

##  [1] "team"                    "pass_d_int_div"         
##  [3] "pass_d_int_g"            "pass_d_int_int_ret_tds" 
##  [5] "pass_d_int_int_ret_yds"  "pass_d_int_opp_int"     
##  [7] "pass_d_int_opp_pass"     "pass_d_yds_div"         
##  [9] "pass_d_yds_g"            "pass_d_yds_opp_cpl"     
## [11] "pass_d_yds_opp_pass"     "pass_d_yds_opp_pass_tds"
## [13] "pass_d_yds_opp_pass_yds" "pass_d_yds_yds_per_att" 
## [15] "pass_d_yds_yds_per_comp" "pass_d_yds_ypg"         
## [17] "pass_o_div"              "pass_o_g"               
## [19] "pass_o_int"              "pass_o_pass_att"        
## [21] "pass_o_pass_com"         "pass_o_pass_td"         
## [23] "pass_o_pass_yds"         "pass_o_yds_per_att"     
## [25] "pass_o_yds_per_comp"     "pass_o_ypg"             
## [27] "run_d_div"               "run_d_g"                
## [29] "run_d_opp_rush"          "run_d_opp_rush_tds"     
## [31] "run_d_opp_rush_yds"      "run_d_yds_per_rush"     
## [33] "run_d_ypg"               "run_o_div"              
## [35] "run_o_g"                 "run_o_rush"             
## [37] "run_o_rush_td"           "run_o_rush_yds"         
## [39] "run_o_yds_per_rush"      "run_o_ypg"              
## [41] "pass_o_pass_rt"          "pass_d_pass_rt"

Alright. Time to kill the weak and irrelevant.

# lock up the good columns and re-tidy the mighty

ncaa_df <- ncaa_df %>% 
  select(team,
         pass_o_pass_rt,
         pass_o_pass_att,
         run_o_yds_per_rush,
         run_o_rush,
         pass_d_pass_rt, 
         pass_d_int_opp_pass,
         run_d_yds_per_rush,
         run_d_opp_rush,
         pass_d_int_div) %>% 
  gather(cat_stat, value, -team) %>% 
  separate(cat_stat, into = c("category", "stat"), sep = "(?<=o|d)_") %>% 
  as_tibble()

# show what we've got

ncaa_df

## # A tibble: 2,268 x 4
##    team              category stat    value
##    <chr>             <chr>    <chr>   <dbl>
##  1 abilene christian pass_o   pass_rt  78.3
##  2 air force         pass_o   pass_rt  86.6
##  3 akron             pass_o   pass_rt  82.3
##  4 alabama           pass_o   pass_rt 112. 
##  5 alabama a&m       pass_o   pass_rt  55.8
##  6 alabama st.       pass_o   pass_rt  56.5
##  7 albany (ny)       pass_o   pass_rt  70.2
##  8 alcorn            pass_o   pass_rt  88.4
##  9 appalachian st.   pass_o   pass_rt 105. 
## 10 arizona           pass_o   pass_rt  90.0
## # ... with 2,258 more rows

We are now positioned to begin the dirty dirty dirty work.

(The regex here was actually the most laborious process of the post. Your author took his seat court side at the regex games and learned a thing or two about look arounds. Regex is a power player.)

The name game

One of the endless delights in data cleaning is name matching. Pulling as we are from two sources, the NCAA and Football Outsiders, it comes as no surprise that there isn’t pure consistency in the names of universities. We must rectify this issue before moving on.

A perusal of both datasets shows that the “state” vs. “st.” spelling might cause a lot of problems. After fixing that, an anti-join will reveal the remaining labor.

# create a team name key from NCAA division 1 programs as FO only has D1 data

ncaa_names <- ncaa_df %>% 
  filter(stat %in% "int_div" & value == 1) %>% 
  select(team) %>% 
  distinct()

# find the missing pieces

missing_names <- ncaa_names %>% 
  anti_join(fo_data %>% mutate(team = gsub("state", "st.", team)))

missing_names %>% print(n = Inf)

## # A tibble: 20 x 1
##    team               
##    <chr>              
##  1 army west point    
##  2 central mich.      
##  3 eastern mich.      
##  4 fiu                
##  5 fla. atlantic      
##  6 ga. southern       
##  7 la.-monroe         
##  8 louisiana          
##  9 miami (fl)         
## 10 miami (oh)         
## 11 middle tenn.       
## 12 nc state           
## 13 northern ill.      
## 14 south fla.         
## 15 southern california
## 16 southern miss.     
## 17 ucf                
## 18 uconn              
## 19 western ky.        
## 20 western mich.

# manual correction for missing names

fo_data <- fo_data %>% 
  mutate(team = gsub("state", "st.", team),
         team = gsub("army", "army west point", team),
         team = gsub("central michigan", "central mich.", team),
         team = gsub("eastern michigan", "eastern mich.", team),
         team = gsub("florida international", "fiu", team),
         team = gsub("florida atlantic", "fla. atlantic", team),
         team = gsub("georgia southern", "ga. southern", team),
         team = gsub("ul-monroe", "la.-monroe", team),
         team = gsub("ul-lafayette", "louisiana", team),
         team = gsub("miami-fl", "miami (fl)", team),
         team = gsub("miami-oh", "miami (oh)", team),
         team = gsub("middle tennessee", "middle tenn.", team),
         team = gsub("nc st.", "nc state", team),
         team = gsub("northern illinois", "northern ill.", team),
         team = gsub("south florida", "south fla.", team),
         team = gsub("usc", "southern california", team),
         team = gsub("southern miss", "southern miss.", team),
         team = gsub("central florida", "ucf", team),
         team = gsub("connecticut", "uconn", team),
         team = gsub("western kentucky", "western ky.", team),
         team = gsub("western michigan", "western mich.", team))

# check again

missing_names_fix <- ncaa_names %>% 
  anti_join(fo_data)

dim(missing_names_fix)

## [1] 0 1

The deed is done. We can now bind to produce our final stats data frame. Gloria in excelsis deo.

# I see you binding out on the floor

fbs_data <- bind_rows(ncaa_df, fo_data) %>% 
  filter(!stat %in% "int_div") # drop division stat

# A final look at the categories

fbs_data %>% 
  group_by(category) %>% 
  tally(sort = TRUE)

## # A tibble: 6 x 2
##   category     n
##   <chr>    <int>
## 1 off        520
## 2 pass_d     504
## 3 pass_o     504
## 4 run_d      504
## 5 run_o      504
## 6 def        390

# A final look at all the stats

fbs_data %>% 
  group_by(stat) %>% 
  tally(sort = TRUE)

## # A tibble: 10 x 2
##    stat                  n
##    <chr>             <int>
##  1 pass_rt             504
##  2 yds_per_rush        504
##  3 pass_down_sp_plus   260
##  4 pass_sp_plus        260
##  5 rush_sp_plus        260
##  6 int_opp_pass        252
##  7 opp_rush            252
##  8 pass_att            252
##  9 rush                252
## 10 adj_pace            130

Thanks for sticking around. Up next, in the much-vaunted part 3 (?), we integrate schedule data and kick off the modeling process. Until then…