6 MIN READ - 1/3/2017
By doing Wikipedia queries on location names from Github profiles, we created a dataset of latitudes and longitudes for Github users around the world.
In this post, we demonstrate how we can add value to an existing dataset by combining it with other freely available sources. What we specifically did here was map free-form text data (more on this below) to geographic locations so that we could map the greatest concentrations of GitHub users.
Recall that one of the fields that we extracted from GitHub profiles via the API was location. The location field allows for free-form text, meaning that users do not have to choose from a list of acceptable place names and can enter anything they want in this field. As a result, multiple names are often used to refer to the same place; for example New York, New York City and the Big Apple could all refer to the same place. This disambiguation problem is sometimes referred to as “Record Linkage” in the academic literature.
Instead of building an internal mapping of locations to entities, we came up with a shortcut, using the search functionality of Wikipedia to map locations and geographical coordinates. Wikipedia may be a dubious resource for historical or scientific research but it is one of the best places to find non-scientific items such as nicknames for places, and its geographic coordinate information is accurate. For all profiles with non-empty location fields, we submitted a search query to Wikipedia, visited the highest-ranked results, and extracted decimal latitude and longitude coordinates from that page. Here is the Ruby script we used:
Of the approximately 500,000 Github profiles containing some kind of location data, we were able to successfully geocode approximately 320,000 using this technique. We then did some manual mapping of the locations that returned no Wikipedia results. For example, many locations in Brazil and near Portland, Oregon (a top secret Microsoft coding lab?) did not return Wiki pages. Similarly, certain snarky responses to the question of location (“the Internet”, “Earth”, /dev/null, etc.) could not be located.
In the next post, we’ll show how to visualize this location data (ie make maps) in R using ggplot2.