Andrew Cross
REFERENCE IMAGE web scraping a college basketball rebounding dataset with beautifulsoup
  1. Data often needs to be cleaned up and fixed – Joe Reed does not weigh 22 pounds
  2. Analysis of the (included) dataset will follow

After Frank Mason‘s incredible performance yesterday (10 pts, 10 reb, 5 ast) in the championship game of the Orlando Classic, I got to thinking. Mason is listed at 5’11” and 185 lbs. Conventional basketball wisdom says there’s no way a kid that size could be out-rebounding teammates (and opponents) that are nearly a full foot taller. Before I could do any sort of analysis, though, I needed data. Time for some web scraping!

Finding a Data Source

Inexplicably, I don’t have a subscription to, so I don’t know if Ken makes this information available, but I couldn’t find any other resources online that would let me sort rebound data by player size. The closest I could find was the NCAA’s own stat page, but they limited the player pool significantly. I wanted to analyze rebounding data for all DI basketball players! This was going to require some web scraping.

I’ve gone over web scraping before, as well as how to use BeautifulSoup. No need to rehash that. Instead, I’ll explain how I was able to use BeautifulSoup to scrape information specific to Sports-Reference’s (SR) college basketball site. A real-world example, if you will.

The first task was to locate the URL that called the data I was looking for. This is exactly the purpose of their play index. I restricted the results to the current season, and opted to sort the them by total rebounds Note, SR offers a tiny url service that allows a particular query to be easily shared, so here’s my query).

2014-15 basketball reference query results sorted by descending total rebounds

The results of this query are exactly what I’m looking for! I could use this information immediately if it weren’t for two shortcomings:

  1. Height and/or weight aren’t included.
  2. Only 100 results are listed per page.

Web Scraping Code

Cutting straight to the chase, below is the code I wrote to scrap what I was looking for. I commented the lines I thought needed explaining, but I’d recommend reading my previous post if anything is confusing.

from bs4 import BeautifulSoup
from urllib import urlopen
import re
import csv

#URL for the query results
urls = ''

#URL-base for the individual player page
url_base = ''
records = []

with open('rebounds.csv', 'wb') as csvfile:

    reboundscsv = csv.writer(csvfile, delimiter=',')

    #100 integer increments are added to the end of the query results URL
    for i in range(0,4500,100):
        page = urlopen(urls + str(i))
        soup = BeautifulSoup(page) #Create the BeautifulSoup object for the query results page
        #The SR source code identifies data rows with blank class names
        for row in soup.findAll('tr',{ 'class' : ''}):
            #A regular expression is used to find only the rows that include a
            #rebound-rank number in the 'csk' html tag
            if row.findAll(csk=re.compile("\d")):
                link = row.find("a")['href'] #Link to the player's individual page
                rebounds = int(row.find('td',{'class':'highlight_text'}).string) #rebounds are highlighted with these results, so key off the class tag to identify the rebounds column
                school = row.contents[11].contents[0].string #The table row is broken up into an array, and the specific column is chosen
                conf = row.contents[13].contents[0].string
                rank = row.contents[1].contents[0].string              
                playerpage = urlopen(url_base + link)
                playersoup = BeautifulSoup(playerpage) #Create the BeautifulSoup object for the player page
                name = playersoup.find('h1').string #The player's name can be identified by being the only thing on the page within the h1 tag
                #A handful of players are missing height and/or weight data, so this statement is necessary
                if playersoup.find(text='Height:'):
                    span = playersoup.find(text='Height:').parent.parent
                    height = span.contents[1].split()[0] #A stupid black box emoji thing is included in the height's span, so spliting is necessary
                    weight = int(span.contents[3])         
                    height = 'N/A'
                    weight = 'N/A'
                #Write to the csv as well as the tabulated list    
                reboundscsv.writerow([rank, name, height, weight, school, rebounds, conf])
                records.append({rank, name, height, weight, school, rebounds, conf})
                print rank


Admittedly, the results still needed to be cleaned up, but welcome to the world of data analysis! Here are a few of the errors I found while looking through the script’s results.

  1. Not only does Jure Gunjima not weigh 0 pounds, his name is actually Jure Gunjina.
  2. Same with Travis Gibson–or should I say, Tavis Gibson.
  3. Multiple players at IPFW were missing weight data. I can only assume some trainer at the school was slow with getting this information posted and made public.
  4. Speaking of IPFW, Joe Reed does not weigh 22 pounds.

Despite these (and more!) reporting errors, SR is the best resource I have available to me for the moment. I’ll be posting an analyses of this dataset shortly, but if you’re impatient and want to dig into it yourselves, here’s a csv of 4447 DI college players and their corresponding rebound totals from the beginning of the season through November 30th, 2014.

2014 rebound numbers as of 11/30/14