This is the second post in a series of data science tutorials involving GitHub profile data.

Summary

To obtain raw data for 2 million public GitHub profiles, we used Ruby to pull from GitHub’s API at the max rate and then used Unix tools and R to combine the data into one CSV.

Gathering data is a key step

Data scraping or retrieval can be the most laborious part of the data science workflow. In this post we review some tools that make this job easier.

In the previous post, we attempted to identify Github spammers by comparing the character frequency distributions of their usernames. Today, we focus on an earlier step in the data science workflow: gathering the raw data. As any aspiring data scientist will quickly learn, data retrieval and management is often the most time intensive part of an analysis.

Github provides a convenient way to download Github events through Github archive. Unfortunately, the Github archive does not provide data about public profiles so we used the Github API to retrieve this data.

We used Ruby to relentlessly pull data over a whole month

Here at Enplus, we don’t consider ourselves a R, Python, Java, or perl shop. We use whatever tool we believe works best for the job. The popularity of Ruby on Rails and the usage of test-driven development (TDD) by many Rails programmers has led to the creation of many excellent Ruby libraries for web programming. We use Ruby to extract data from the Github API.

require 'uri'
require 'net/http'
require 'net/https'
require 'json'

@since = 1
@users_json = []
@log = File.new("logs/github_profiles-#{Time.now.to_s}.log", 'w+')

def parse_github_profiles(offset, username, password)
  github_api = 'https://api.github.com'
  uri = URI.parse github_api
  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true

  fields = %w(login id html_url name company blog location email hireable 
              bio public_repos followers following created_at public_gists)
  req_users = Net::HTTP::Get.new("/users?since=#{offset}&per_page=500")
  req_users.basic_auth "user_name", 'password' #authenticated requests have larger usage limit
  users_body = http.request(req_users).body

  @users_json = JSON.parse(users_body)

  @users_json.each do |user_json|
    str = ""
    req_info = Net::HTTP::Get.new("/users/#{user_json['login'].to_s}")
    req_info.basic_auth username, password
    user_info = JSON.parse(http.request(req_info).body)
    puts "#{user_json["login"]}\t#{user_json['id']}"
    @since = user_json['id']
    fields.each do |field|
      str << "\"#{user_info[field].to_s.gsub('"','\'')}\""
      str << "\t"
    end
    @log.puts str.chomp("\t")
  end 
end

while @since < 2300000 do
  begin
    parse_github_profiles(@since)
  rescue => error
    puts "Something bad happened, paused for 1 hour. Current time: #{Time.now}"
    puts error.message
    sleep(45 * 60) # sleep 45 minutes we likely hit the API usage limit rate
    next
  end
end
  
@log.close

One difficulty of extracting profiles through the Github API was the usage limit. The API only allows 5000 requests per hour. It took about a month for this script to finish.

We started this script several times, so at the end there were around 10 .log tab-separated files each containing batches of profiles.

We combined data files into one convenient CSV

With a combination of Unix command line tools and some basic R scripting, we gathered them in single dataset, added column names, removed rows with duplicate Github IDs, and saved them to a .csv file:

system('cat *.log > logs.txt')
logs <- read.delim("logs.txt", header=FALSE)

colnames(logs) <- c("login", "id", "html_url", "name", "company", "blog",
                    "location", "email", "hireable", "bio", "public_repos",
                    "followers", "following", "created_at", "public_gists")
logs <- logs[!(duplicated(logs$id)), ]

write.csv(logs, "github_profiles.csv", row.names=FALSE)

Data collection is an often unavoidable and time consuming part of any analytics project (the Github profiles took a month to gather!). Nonetheless, using the right tools goes a long way towards making the process quicker and more reliable.

In the next post, we’ll explore how to clean and visualize the location data we’ve extracted so we can learn where the greatest concentrations of Github contributors spend their days.