5 min read

A Few Bars Blues: Part I

Rejoice readers! The dearly beloved GLORIA St. Louis Blues made the Stanley Cups finals after a convincing drubbing of the San Jose Sharts. You will no doubt recall the hockey team abbreviated STL is the oldest active NHL team to never have won a Stanley Cup.

The purpose of our next few posts is to examine the common narrative about the Blues that they went from being the NHL’s worst team in the 1st half of the year to being the NHL’s best team in the 2nd half of the year. Here are the steps we will take to do so, with the help of our indomitable friend R.

Today, on the day of the first Stanley Cup game, we will do two things.

  • Pull all game data for each team for the 2018-2019 season
  • Establish basic performance facts

And, as usual, here is the moneymaker chart from the day’s proceedings.

library(tidyverse)
library(rvest)
library(lubridate)
library(zoo)
extrafont::loadfonts(device = "win")

theme_set(theme_minimal(base_family = "Gill Sans MT"))

blues_colors <- c("#002e88", "#ffc72b", "#304767")

Pulling the game data

The lovely hockey-reference.com features all our required information. Let us pull.

# retrieve three digit codes for each team

link <- read_html("https://www.hockey-reference.com/leagues/NHL_2019_standings.html")

# select the standings table and pull the underlying links

full_names <- link %>% 
  html_node(xpath = '//*[@id="standings"]') %>% 
  html_table() %>% 
  .[, 2]

three_letters <- link %>% 
  html_node(xpath = '//*[@id="standings"]') %>% 
  html_nodes("a") %>% 
  html_attr("href") %>% 
  gsub("[^A-Z]", "", .)

team_name_df <- tibble(abbr = three_letters,
                       team = full_names)

# build our link list

link_list <- paste0('https://www.hockey-reference.com/teams/', team_name_df$abbr, '/2019_gamelog.html')

# pull data

season_stats <- map_df(seq_along(link_list), ~ {
       
       stats <- read_html(link_list[.x]) %>% 
         html_nodes("table") %>% 
         html_table(header = FALSE) %>%
         bind_rows(.id = "epoch")
       
       stats <- stats[, colSums(!is.na(stats)) != 0]
       
       names(stats) <- c(
         "epoch", "gp", "date", "h_a", "opponent", "gf", "ga",
         "result", "ot", "t_s", "t_pim", "t_ppg", "t_ppo", "t_shg",
         "o_s", "o_pim", "o_ppg", "o_ppo", "o_shg", "a_cf", "a_ca",
         "a_cf_pct", "a_ff", "a_fa", "a_ff_pct", "a_fow", "a_fol", 
         "a_fo_pct", "a_ozs_pct", "a_pdo"
       )
       
       stats <- as_tibble(stats[-c(1:2), ]) %>% 
         filter(!grepl("Opponent", opponent), nchar(opponent) != 0) %>% 
         mutate(team = tolower(team_name_df$team[.x]),
                opponent = tolower(opponent),
                gp = as.numeric(gp),
                h_a = ifelse(nchar(h_a) == 0, "home", "away"),
                result = ifelse(result == "L", 0, 1),
                ot = case_when(
                  ot == "OT" ~ "ot",
                  ot == "SO" ~ "so",
                  TRUE ~ "reg"
                  ),
                epoch = case_when(
                  epoch == 1 ~ "rs",
                  epoch == 2 ~ "playoff"
                  ),
                points = case_when(
                  epoch == "rs" & result == 1 ~ 2,
                  epoch == "rs" & result == 0 & ot %in% c("ot", "so") ~ 1,
                  TRUE ~ 0),
                points_total = cumsum(points)
                ) %>%  
         select(epoch:h_a, team, opponent, result, ot, points_total, gf:a_pdo) %>% 
         mutate_at(vars(gf:a_pdo), list(as.numeric))
       }
       )

With all the data pulled let’s check to make sure it is accurate.

season_stats %>% 
  group_by(team) %>% 
  summarize(pts = max(points_total)) %>% 
  arrange(-pts)
## # A tibble: 31 x 2
##    team                  pts
##    <chr>               <dbl>
##  1 tampa bay lightning   128
##  2 boston bruins         107
##  3 calgary flames        107
##  4 washington capitals   104
##  5 new york islanders    103
##  6 san jose sharks       101
##  7 nashville predators   100
##  8 pittsburgh penguins   100
##  9 toronto maple leafs   100
## 10 carolina hurricanes    99
## # ... with 21 more rows

Establishing basic performance facts

Now for fun, let’s plot each team’s cumulative points over the season.

reg_season <- season_stats %>% filter(epoch == "rs")

ggplot() +
  geom_line(data = reg_season %>% filter(!team %in% "st. louis blues"), 
       aes(gp, points_total, group = team),
       color = "grey") +
  geom_line(data = reg_season %>% filter(team %in% "st. louis blues"), 
       aes(gp, points_total, group = team),
       color = blues_colors[1],
       size = 2) +
  labs(x = "Games Played",
       y = "Points",
       title = "Cumulative Points by NHL Team in 2018-2019",
       subtitle = "St. Louis Blues highlighted in blue",
       caption = "Sources: verbumdata.netlify.com, hockey-reference.com")

Just after the 50th game, the Blues really accelerated. But we can put a finer grain on the Blues’s acceleration. Let’s look at the rolling 5 game slope of performance. The Blues won 11 consecutive games from 2019-01-23 to 2019-02-19! That is quite a run.

reg_season %>% 
  filter(team %in% "st. louis blues") %>% 
  mutate(slope = points_total - lag(points_total, 1),
         roll_slope = rollmean(slope, 5, na.pad = TRUE, align = "right")
  ) %>% 
  ggplot(aes(gp, roll_slope)) +
  geom_line(color = "grey") +
  geom_smooth(method = "loess", se = FALSE, color = blues_colors[1]) +
  geom_curve(data = reg_season %>% filter(team %in% "st. louis blues"),
             aes(x = 40, xend = 52,
                 y = 1.85, yend = 2),
             color = blues_colors[2],
             curvature = -0.5,
             arrow = arrow(length = unit(0.25, "cm"),
                           type = "closed")) +
  annotate("text", x = 40, y = 1.75, 
           label = str_wrap("A slope of 2 on this graph means 5 consecutive wins", 20),
           size = 3,
           color = blues_colors[3],
           hjust = 0.5) +
  labs(x = "Games Played",
       y = "Slope",
       title = "St. Louis Blues's Rolling Slope of Cumulative Points",
       subtitle = "5-Game Rolling Average of Slope",
       caption = "Sources: verbumdata.netlify.com, hockey-reference.com") 
## Warning: Removed 5 rows containing non-finite values (stat_smooth).
## Warning: Removed 5 rows containing missing values (geom_path).

We’ve established basic performance facts. We know that the Blues dramatically accelerated their play over the first few months of 2019. Over the course of the Stanley Cup Finals, we will dive into the reasons behind their elevation. The next tasks are as follows:

  • Look at the Blues’s dynamics over the season
  • Compare changes witnessed in STL with other teams in the league

Thank you all for reading.