Skip to content Skip to sidebar Skip to footer

Scraping Data From Tables On Multiple Web Pages In R (football Players)

I'm working on a project for school where I need to collect the career statistics for individual NCAA football players. The data for each player is in this format. http://www.sport

Solution 1:

Here's how you can easily get all the data in all the tables on all the player pages...

First make a list of the URLs for all the players' pages...

require(RCurl); require(XML)
n <- length(letters) 
# pre-allocate list to fill
links <- vector("list", length = n)
for(i in 1:n){
  print(i) # keep track of what the function is up to# get all html on each page of the a-z index pages
  inx_page <- htmlParse(getURI(paste0("http://www.sports-reference.com/cfb/players/", letters[i], "-index.html")))
  # scrape URLs for each player from each index page
  lnk <- unname(xpathSApply(inx_page, "//a/@href"))
  # skip first 63 and last 10 links as they are constant on each page
  lnk <- lnk[-c(1:63, (length(lnk)-10):length(lnk))]
  # only keep links that go to players (exclude schools)
  lnk <- lnk[grep("players", lnk)]
  # now we have a list of all the URLs to all the players on that index page# but the URLs are incomplete, so let's complete them so we can use them from # anywhere
  links[[i]] <- paste0("http://www.sports-reference.com", lnk)
}
# unlist into a single character vector
links <- unlist(links)

Now we have a vector of some 67,000 URLs (seems like a lot of players, can that be right?), so:

Second, scrape all the tables at each URL to get their data, like so:

# Go to each URL in the list and scrape all the data from the tables# this will take some time... don't interrupt it!# start edit1 here - just so you can see what's changed# pre-allocate list
all_tables <- vector("list",length=(length(links)))for(i in1:length(links)){
  print(i)# error handling - skips to next URL if it gets an error
  result <- try(
    all_tables[[i]]<- readHTMLTable(links[i], stringsAsFactors =FALSE)); if(class(result)=="try-error")next;
}# end edit1 here# Put player names in the list so we know who the data belong to# extract names from the URLs to their stats page...
toMatch <-c("http://www.sports-reference.com/cfb/players/","-1.html")
player_names <- unique (gsub(paste(toMatch,collapse="|"),"", links))# assign player names to list of tablesnames(all_tables)<- player_names

The result looks like this (this is just a snippet of the output):

all_tables
$`neli-aasa`$`neli-aasa`$defense
   Year School Conf Class Pos Solo Ast Tot Loss  Sk Int Yds Avg TD PD FR Yds TD FF
1 *2007   Utah  MWC    FR  DL    2   1   3  0.0 0.0   0   0      0  0  0   0  0  0
2 *2010   Utah  MWC    SR  DL    4   4   8  2.5 1.5   0   0      0  1  0   0  0  0

$`neli-aasa`$kick_ret
   Year School Conf Class Pos Ret Yds  Avg TD Ret Yds Avg TD
1 *2007   Utah  MWC    FR  DL   0   0       0   0   0      0
2 *2010   Utah  MWC    SR  DL   2  24 12.0  0   0   0      0

$`neli-aasa`$receiving
   Year School Conf Class Pos Rec Yds  Avg TD Att Yds Avg TD Plays Yds  Avg TD
1 *2007   Utah  MWC    FR  DL   1  41 41.0  0   0   0      0     1  41 41.0  0
2 *2010   Utah  MWC    SR  DL   0   0       0   0   0      0     0   0       0

Finally, let's say we just want to look at the passing tables...

# just show passing tables
passing <- lapply(all_tables,function(i) i$passing)# but lots of NULL in here, and not a convenient format, so...
passing <- do.call(rbind, passing)

And we end up with a data frame that is ready for further analyses (also just a snippet)...

YearSchoolConfClassPosCmpAttPctYdsY/AAY/ATDIntRatejames-aaron1978          AirForceIndQB285650.03165.63.61392.6jeff-aaron.12000 Alabama-BirminghamCUSAJRQB10018254.91135 6.26.053113.1jeff-aaron.22001 Alabama-BirminghamCUSASRQB7714852.08285.64.34699.8

Post a Comment for "Scraping Data From Tables On Multiple Web Pages In R (football Players)"