This is the third post in a series of data science tutorials using GitHub profile data. In the previous post, we extracted data from the Github API using Ruby and cleaned it with Unix utilities and R. You can read the previous post here: parsing github profiles.

Summary

By doing Wikipedia queries on location names from Github profiles, we created a dataset of longitude and latitudes for Github users worldwide.

Our Ruby script used Wikipedia’s search capability to resolve location names

In this post, we show how a data scientist can add value to an existing dataset by combining it with other freely available sources. In particular, we will map free-form text data to geographic locations so we can find the greatest concentrations of Github users. Let the Github/Wikipedia Geocoding Mashup begin!

The data we previously extracted from the Github API contains a location field. It is free-form text – users can enter anything they want. Consequently, people refer to the same place by multiple names. For example, New York, New York City, and The Big Apple could all refer to the same entity. This disambiguation problem is sometimes referred to as “Record Linkage” in the academic literature.

Instead of building an internal mapping of locations to entities, we decided to use the search functionality of Wikipedia to map locations and geographical coordinates. For profiles with non-empty locations, we submitted a search query to Wikipedia, visited the highest ranked result, and extracted decimal longitude and latitude coordinates from the page. Here is the Ruby script we used:

require 'nokogiri'
require 'net/http'
require 'csv'
@log = File.new("logs/geo-#{Time.now.to_s}.log", 'w+')
@wrong = File.new("logs/unparsed_geo-#{Time.now.to_s}.log", 'w+')

def get_coordinates(location_string)
    search_url = "http://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=#{URI.escape(location_string)}&fulltext=Search"

    uri = URI.parse search_url
    http = Net::HTTP.new(uri.host, uri.port)
    req = Net::HTTP::Get.new uri
    r = http.request(req)
    search_results = Nokogiri::HTML(r.body)
    path = search_results.css(".mw-search-result-heading a").first[:href]

    r_new = Net::HTTP::Get.new path
    wiki_page = http.request(r_new)

    page = Nokogiri::HTML(wiki_page.body)
    raw_coords = page.css(".geo-dec").first.text
    convert_coordinates(raw_coords)
end

def convert_coordinates(str)
    raw = str.scan(/(.*)??(N|S) (.*)??(E|W)/)
    longd = raw[0][0].to_f
    sign_long = raw[0][1] == 'N' ? 1 : -1
    long = (longd) * sign_long

    latd = raw[0][2].to_f
    sign_lat= raw[0][3] == 'E' ? 1 : -1
    lat = (latd) * sign_lat

    return long, lat
end

CSV.foreach("logs/geo_tagged_profiles.csv") do |row|
    raw_location = row[6]
    puts  raw_location
    str = row.to_csv.chomp
    begin
        long, lat = get_coordinates(raw_location)
        str << ",#{long},#{lat}"
        @log.puts str
    rescue => error
        puts "error in parsing previous location"
        str << ",,"
        @log.puts str
        @wrong.puts str
        next
    end
end
@log.close

We successfully geocoded most profiles

Of the approximately 500,000 Github profiles with some kind of location data, we were able to geocode approximately 320,000 using this technique. We did some manual mapping on the locations that returned no Wikipedia results. For example, many locations in Brazil and near Portland, Oregon (secret Microsoft coding labs?) did not return Wiki pages. Similarly, “The Internet”, “Earth”, and /dev/null could not be geographically located.

In the next post, we'll show how to visualize this location data (make maps) in R using ggplot2.