To obtain raw data for 2 million public GitHub profiles, we used Ruby to pull from GitHub’s API at the max rate and then used Unix tools and R to combine the data into one CSV.
In the last post we showed how we discovered consistent differences between active and inactive Github profiles. But how did we get the data from those profiles in the first place? In this post we take a step back and show you the big data tools and techniques we used to obtain the raw data from over 2 million GitHub profiles.
Data retrieval is often the most time-intensive part of data analysis. Gathering extremely large data sets and putting them into a format conducive to analysis requires flexibility of approach as well as good, old-fashioned patience. In this case we were ‘scraping’ the data, or harvesting it from a third-party entity. While Github does provide a convenient way to download Github events through their archive, the archive does not provide data about public profiles. We had to turn to the GitHub API and figure out which tools to use to extract the profile data.
Here at EnPlus we do not stick to one programming language and we would never describe ourselves as primarily being an R, Python, Java, or perl shop. We believe that different tools have different strengths, and we always keep an open mind about which ones would work best in any given situation.
For this data retrieval, we decided to use Ruby. The popularity of Ruby on Rails and the usage of Test Driven Development by many Rails programmers means that there are many excellent Ruby libraries available for web programming. We’ve provided the details below.
As you may have noticed, each GitHub profile has 15 different fields, including name, company, location, etc. We hit a minor obstacle right away with the usage limit of the GitHub API, which only allows for 5000 requests per hour. If you would like to try a back-of-the-envelope calculation on that one, here goes: we were extracting fifteen different variables from around 2 million profiles and we were limited to 5000 requests per hour. If you’d like to skip the math, we’ll just tell you how long it took: a month. Yes, you’ve read that correctly. It took us a month to collect all the data from 2 million Github public profiles. We also had to start the program several times-ten, to be exact-so we ended up with 10 .log tab-separated files containing different batches of profile information. Remember what we said about patience?
And we still weren’t done. Simply having the data isn’t enough, it needs to be arranged so that it can be easily parsed and analyzed. For this, we used a combination of Unix command lines and basic R code to gather the ten tab-separated files into a single dataset. We then added column names, removed all the duplicates, and saved that one big dataset to a .csv, or Comma Separated Values file. Again, the details of how we did this are shown below.
This month-long process-which was only the first step in our GitHub profile analysis-is a good reminder that they call it Big Data for a reason. When we are dealing with such a huge amount of information the choice of tools and the resolve to see the job through are crucial. Having a familiarity with multiple programming languages and options is a valuable asset. Stay with us for the next post, Geocoding Github Profiles, where we explore how to visualize the location data we’ve extracted so we can learn where the greatest concentrations of Github users are based.