- Data often needs to be cleaned up and fixed – Joe Reed does not weigh 22 pounds
- Analysis of the (included) dataset will follow
After Frank Mason‘s incredible performance yesterday (10 pts, 10 reb, 5 ast) in the championship game of the Orlando Classic, I got to thinking. Mason is listed at 5’11” and 185 lbs. Conventional basketball wisdom says there’s no way a kid that size could be out-rebounding teammates (and opponents) that are nearly a full foot taller. Before I could do any sort of analysis, though, I needed data. Time for some web scraping!
Finding a Data Source
Inexplicably, I don’t have a subscription to kenpom.com, so I don’t know if Ken makes this information available, but I couldn’t find any other resources online that would let me sort rebound data by player size. The closest I could find was the NCAA’s own stat page, but they limited the player pool significantly. I wanted to analyze rebounding data for all DI basketball players! This was going to require some web scraping.
I’ve gone over web scraping before, as well as how to use BeautifulSoup. No need to rehash that. Instead, I’ll explain how I was able to use BeautifulSoup to scrape information specific to Sports-Reference’s (SR) college basketball site. A real-world example, if you will.
The first task was to locate the URL that called the data I was looking for. This is exactly the purpose of their play index. I restricted the results to the current season, and opted to sort the them by total rebounds Note, SR offers a tiny url service that allows a particular query to be easily shared, so here’s my query).
The results of this query are exactly what I’m looking for! I could use this information immediately if it weren’t for two shortcomings:
- Height and/or weight aren’t included.
- Only 100 results are listed per page.
Web Scraping Code
Cutting straight to the chase, below is the code I wrote to scrap what I was looking for. I commented the lines I thought needed explaining, but I’d recommend reading my previous post if anything is confusing.
from bs4 import BeautifulSoup
from urllib import urlopen
import re
import csv
#URL for the query results
urls = 'http://www.sports-reference.com/cbb/play-index/psl_finder.cgi?request=1&match=single&year_min=2015&year_max=2015&conf_id=&school_id=&class_is_fr=Y&class_is_so=Y&class_is_jr=Y&class_is_sr=Y&pos_is_g=Y&pos_is_gf=Y&pos_is_f=Y&pos_is_fg=Y&pos_is_fc=Y&pos_is_c=Y&pos_is_cf=Y&games_type=A&qual=&c1stat=&c1comp=gt&c1val=&c2stat=&c2comp=gt&c2val=&c3stat=&c3comp=gt&c3val=&c4stat=&c4comp=gt&c4val=&order_by=trb&order_by_asc=&offset='
#URL-base for the individual player page
url_base = 'http://www.sports-reference.com'
records = []
with open('rebounds.csv', 'wb') as csvfile:
reboundscsv = csv.writer(csvfile, delimiter=',')
#100 integer increments are added to the end of the query results URL
for i in range(0,4500,100):
page = urlopen(urls + str(i))
soup = BeautifulSoup(page) #Create the BeautifulSoup object for the query results page
#The SR source code identifies data rows with blank class names
for row in soup.findAll('tr',{ 'class' : ''}):
#A regular expression is used to find only the rows that include a
#rebound-rank number in the 'csk' html tag
if row.findAll(csk=re.compile("\d")):
link = row.find("a")['href'] #Link to the player's individual page
rebounds = int(row.find('td',{'class':'highlight_text'}).string) #rebounds are highlighted with these results, so key off the class tag to identify the rebounds column
school = row.contents[11].contents[0].string #The table row is broken up into an array, and the specific column is chosen
conf = row.contents[13].contents[0].string
rank = row.contents[1].contents[0].string
playerpage = urlopen(url_base + link)
playersoup = BeautifulSoup(playerpage) #Create the BeautifulSoup object for the player page
name = playersoup.find('h1').string #The player's name can be identified by being the only thing on the page within the h1 tag
#A handful of players are missing height and/or weight data, so this statement is necessary
if playersoup.find(text='Height:'):
span = playersoup.find(text='Height:').parent.parent
height = span.contents[1].split()[0] #A stupid black box emoji thing is included in the height's span, so spliting is necessary
weight = int(span.contents[3])
else:
height = 'N/A'
weight = 'N/A'
#Write to the csv as well as the tabulated list
reboundscsv.writerow([rank, name, height, weight, school, rebounds, conf])
records.append({rank, name, height, weight, school, rebounds, conf})
print rank
Admittedly, the results still needed to be cleaned up, but welcome to the world of data analysis! Here are a few of the errors I found while looking through the script’s results.
- Not only does Jure Gunjima not weigh 0 pounds, his name is actually Jure Gunjina.
- Same with Travis Gibson–or should I say, Tavis Gibson.
- Multiple players at IPFW were missing weight data. I can only assume some trainer at the school was slow with getting this information posted and made public.
- Speaking of IPFW, Joe Reed does not weigh 22 pounds.
Despite these (and more!) reporting errors, SR is the best resource I have available to me for the moment. I’ll be posting an analyses of this dataset shortly, but if you’re impatient and want to dig into it yourselves, here’s a csv of 4447 DI college players and their corresponding rebound totals from the beginning of the season through November 30th, 2014.