2 min read

Clean up your webscraped html file

I recently embarked on a web scraping project. The results will follow on this blog in weeks to come. In the meantime, though, I want to share one little lesson I learned which may be helpful to others webscraping in R.

Imagine you have a vector of article links. You might start by scraping each with purrr and rvest. But say you wanted only to extract the body of the article (using the most probable html tag for such a thing across publishers). You’d probably run something like this code.

library(rvest)
library(purrr)
library(xml2)

# news article dummy links

articles <- c("https://www.mylocalpaper.com/scandal-pg-1",
              "https://www.yetanotherrag.com/celeb-journal")

# scraping function

get_article_text <- function(article)  {
  read_html(article)
  html_nodes("p") %>% 
  html_text()
}

# purrrfect

article_text <- map(articles, ~ get_article_text(.x))

And that wouldn’t be so bad. But, you’d notice one particularly pesky inclusion across most newspaper articles: comments.

For my purposes, I had to strip out comments. Below is a function which allows the intrepid webscraper to do just that. For that matter, you could exclude any undesirable node in your code. In case you can’t find the link, you might wrap the whole thing in purrr::possibly and set otherwise to NA_character.

# exclude comments with xml_remove

text_no_comment <- possibly(
  function(article) {
  art_html <- read_html(article)
  xml2::xml_remove(art_html %>% html_nodes("#comments"))
  #with comments removed now read body section as character & strip html
  art_html %>% 
  html_nodes("p") %>% 
  html_text()
  }, 
  otherwise = NA_character_
)

article_text <- map(articles, ~ text_no_comment(.x))

I hope this little trick helps those looking to clean up their webscraping functions. You could do purrr::map_chr depending on your desired output.