5 min read

NCAA football model building part III

Alright y’all. Welcome to part 3. You can find parts one and two at the links. Let us stay excited for the fitness you’ve come to witness.

Today’s plan proceeds thus:

  • import last year’s NCAA FBS schedule
  • bring in FBS data from part two and normalize data
  • determine match ups and add in team data

Please note that this isn’t final as we still lack market information. However, this is a terrific beginning. In case any readers have a better implementation of the final look up, please hit me on Twitter.

Importing the NCAA schedule

library(tidyverse)

# set links to download and create tibbles

divs <- c("FBS", "FCS")

links <- paste0("http://fs.ncaa.org/Docs/stats/football_records/sched/2017/", divs, ".csv")

# import schedule for all divisions

schedule <- map_df(seq_along(links), ~ 
                     read_csv(links[.x]) %>% 
                     mutate(div = divs[.x]) %>% 
                     select_all(tolower) %>% 
                     mutate(game_date = lubridate::mdy(game_date)) %>% 
                     mutate_if(is.character, tolower))

# create matchup df

schedule_matchup <- schedule %>% 
  mutate(matchup = ifelse(loc == "h", 
                          paste0(org_name, "_", opponent_name), 
                          paste0(opponent_name, "_", org_name))) %>% 
  separate(matchup, into = c("home", "away"), sep = "_") %>% 
  select(date = game_date, home, away) %>% 
  distinct()

schedule_matchup
## # A tibble: 1,510 x 3
##    date       home         away           
##    <date>     <chr>        <chr>          
##  1 2017-09-02 air force    vmi            
##  2 2017-09-16 michigan     air force      
##  3 2017-09-23 air force    san diego st.  
##  4 2017-09-30 new mexico   air force      
##  5 2017-10-07 navy         air force      
##  6 2017-10-14 air force    unlv           
##  7 2017-10-20 nevada       air force      
##  8 2017-10-28 colorado st. air force      
##  9 2017-11-04 air force    army west point
## 10 2017-11-11 air force    wyoming        
## # ... with 1,500 more rows

Bring in FBS data

We’ve got the schedule pulled and cleaned. On to the next one, which, in this case, means importing the FBS statistics we pulled in part two. Your author took the liberty to save down the needed data from part two, so a simple readRDS should do the trick.

This gets a little nasty as we apply some functionals to create new columns. Once again, astute readers please let me know if you see opportunities for improving this code.

# get opponent stats, first nest

sched_nest <- schedule %>% nest(-org_name)

# read in fbs_data

fbs_data <- readRDS("../../static/fbs_data.rds")

# make wide

fbs_wide <- fbs_data %>% 
  mutate(cat_stat = paste0(category, "_", stat)) %>% 
  select(team, cat_stat, value) %>% 
  spread(cat_stat, value)

# add in opponent stats

fbs_wide$opp_pass_d_pass_rt <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "pass_d_pass_rt"][[1]], na.rm = TRUE))

fbs_wide$opp_run_d_yds_per_rush <-  map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "run_d_yds_per_rush"][[1]], na.rm = TRUE))

fbs_wide$opp_pass_o_pass_rt <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "pass_o_pass_rt"][[1]], na.rm = TRUE))

fbs_wide$opp_run_o_yds_per_rush <- map_dbl(seq_along(fbs_wide$team), ~ mean(fbs_wide[fbs_wide$team %in% sched_nest[, "data"][[1]][[.x]]$opponent_name, "run_o_yds_per_rush"][[1]], na.rm = TRUE))

# calc normalized stats

fbs_wide_ncaa <- fbs_wide %>% 
  mutate(norm_pass_o_pass_rt = pass_o_pass_rt - opp_pass_d_pass_rt,
         norm_run_o_yds_per_rush = run_o_yds_per_rush - opp_run_d_yds_per_rush,
         norm_pass_d_pass_rt = pass_d_pass_rt - opp_pass_o_pass_rt,
         norm_run_d_yds_per_rush = run_d_yds_per_rush - run_o_yds_per_rush) %>% 
  select(team, norm_pass_o_pass_rt, norm_run_o_yds_per_rush, 
         norm_pass_d_pass_rt, norm_run_d_yds_per_rush) %>% 
  gather(stat, value, -team) %>% 
  group_by(stat) %>% 
  mutate(value = case_when(
    is.na(value) ~ quantile(value, 0.05, na.rm = TRUE),
    TRUE ~ value)) %>% 
  spread(stat, value)

fbs_wide_ncaa
## # A tibble: 253 x 5
##    team  norm_pass_d_pas~ norm_pass_o_pas~ norm_run_d_yds_~
##    <chr>            <dbl>            <dbl>            <dbl>
##  1 abil~             8.29            -8.72            1.73 
##  2 air ~            10.6             -1.61            1.11 
##  3 akron            -6.14            -2.22            1.82 
##  4 alab~           -31.6             22.3            -3.01 
##  5 alab~            -6.30           -28.5             1.28 
##  6 alab~           -27.5            -27.8             0.210
##  7 alba~            -9.88            -9.35            0.440
##  8 alco~            -7.83            -4.39           -2.4  
##  9 appa~            -9.97            14.6            -1.59 
## 10 ariz~            -6.75             5.56           -1.95 
## # ... with 243 more rows, and 1 more variable:
## #   norm_run_o_yds_per_rush <dbl>

Weekly Matchups with data

Now we will compute our game statistics for each match up and add to the data frame. This function isn’t the cleanest, and any readers with ideas for improvement please let me know on Twitter.

# write function for normalizing

team_stat <- function(df, team_one = NULL, team_two = NULL, 
                      stat_one = NULL, stat_two = NULL) {
  team_one <- enquo(team_one)
  team_two <- enquo(team_two)
  stat_one <- enquo(stat_one)
  stat_two <- enquo(stat_two)

  # offensive stats
  team_one_stat_sample <- df %>% 
    filter(team %in% !!team_one) %>% 
    pull(!!stat_one)
  
  home_stats_pop <- df %>% 
    select(!!stat_one) %>% 
    summarize(mean = mean(.[[1]], na.rm = TRUE),
              sd  = sd(.[[1]], na.rm = TRUE))
  # away stats
  team_two_stat_sample <- df %>% 
    filter(team %in% !!team_two) %>% 
    pull(!!stat_two)
  
  away_stats_pop <- df %>% 
    select(!!stat_two) %>% 
    summarize(mean = mean(.[[1]], na.rm = TRUE),
              sd   = sd(.[[1]], na.rm = TRUE))
  # calcuate statistic
  calc_stat <- ((team_one_stat_sample + team_two_stat_sample) -
               (home_stats_pop$mean + away_stats_pop$mean)) /
               sqrt(home_stats_pop$sd ^ 2 + away_stats_pop$sd ^ 2)
  
  rm(team_one_stat_sample, home_stats_pop, team_two_stat_sample, away_stats_pop)
  
  return(calc_stat)
}

Not the prettiest, but it gets the job done. Speaking of not pretty, wait until you feast your eyes on the code below. The difficulty encountered here is applying the function to each team, home and away, for each week. It takes an eternity to run but seems to work.

# create df with data by matchup for each game in 2017

fbs_full <- schedule_matchup %>% 
  mutate(home_pass_norm = map2(home, away, ~ 
           team_stat(fbs_wide_ncaa, .x, .y, "norm_pass_o_pass_rt", "norm_pass_d_pass_rt")) %>% 
           lapply(., `length<-`, max(lengths(.))) %>% 
           unlist(),
         away_pass_norm = map2(away, home, ~ 
           team_stat(fbs_wide_ncaa, .x, .y, "norm_pass_o_pass_rt", "norm_pass_d_pass_rt")) %>% 
           lapply(., `length<-`, max(lengths(.))) %>% 
           unlist(),
         home_run_norm = map2(home, away, ~ 
           team_stat(fbs_wide_ncaa, .x, .y, "norm_run_o_yds_per_rush", "norm_run_d_yds_per_rush")) %>% 
           lapply(., `length<-`, max(lengths(.))) %>% 
           unlist(),
         away_run_norm = map2(away, home, ~ 
           team_stat(fbs_wide_ncaa, .x, .y, "norm_run_o_yds_per_rush", "norm_run_d_yds_per_rush")) %>% 
           lapply(., `length<-`, max(lengths(.))) %>% 
           unlist())

fbs_full
## # A tibble: 1,510 x 7
##    date       home  away  home_pass_norm away_pass_norm home_run_norm
##    <date>     <chr> <chr>          <dbl>          <dbl>         <dbl>
##  1 2017-09-02 air ~ vmi           1.06           -0.959        1.31  
##  2 2017-09-16 mich~ air ~        -0.357          -0.899        0.564 
##  3 2017-09-23 air ~ san ~        -0.0181          1.10        -0.847 
##  4 2017-09-30 new ~ air ~        -0.693           0.977        1.34  
##  5 2017-10-07 navy  air ~        -0.0341          1.18         1.54  
##  6 2017-10-14 air ~ unlv          0.556           0.161        0.312 
##  7 2017-10-20 neva~ air ~         0.965           1.11         0.907 
##  8 2017-10-28 colo~ air ~         0.991          -0.559        1.27  
##  9 2017-11-04 air ~ army~         0.127          -2.11        -0.0802
## 10 2017-11-11 air ~ wyom~        -0.833           1.01         0.523 
## # ... with 1,500 more rows, and 1 more variable: away_run_norm <dbl>

For now, this is all. The next steps are, 1) pulling in market information and 2) translating our normalized scores into expected lines to judge against the market information. With that done we can look for the outliers and plan accordingly.