Using our dataset of Github user longitudes and latitudes from Part 3, we visualize and analyze the distribution of user locations using R with the maps and ggplot2 packages.
In this post, we map Github profiles across the USA. Our first step was to generate a map of the USA using the
maps package. We then added a layer of points, with longitude as x and latitude as y using the
points function in R, with a point size of 0.1 and an alpha of 0.75. At this point size, the points are visible and overplotting does not seem to be an issue.
As with most analyses, this result is only a starting point; from here we can modify the map to provide more information. We could, for example:
mapdata package, which supplements the
maps package, can help us to achieve some of these goals.
Adding state boundaries and cities brings a little more order to our map, and we can now confirm what you may have already suspected-that Github users are clustered in major cities, particularly the San Francisco bay area, Los Angeles, Chicago, the Mid-Atlantic region and New England. The largest clusters are all found in densely populated urban centers.
There is still an issue with our map, however-many of the city labels are overplotted. The Mid-Atlantic, San Francisco and Los Angeles regions, for example, look like big blots of ink (or, if you prefer, pixels) and do not give us much in the way of specific information; they simply tell us that there is a high concentration of Github users in these areas. We can fix this in a couple of ways.
First, we’ll change the dots to circles whose size correlates to the number of Githubbers. To do that, we’ll need to transform the geographical coordinates to calculate their frequencies. We do this with the
data.table R package:
Next we’ll redo the maps that we made, only this time instead of the maps package, we’ll use the more modern and versatile
ggplot2. Maps is based on the original graphics package (called, unoriginally, graphics) included with
R. Graphics is fine for simple visualizations, but it does not suit us here because it does not support layers. Whatever we add to the original graphics-produced map will appear above our earlier additions, creating a cluttered and confusing mess.
ggplot2, on the other hand, is built on top of the grid package. grid is a newer package than graphics and allows for a more object-oriented approach to constructing data visualizations.
ggplot2 further builds upon grid by providing a high-level interface to mapping data with visuals. The chief strength of ggplot2 is that we can customize our graph by adding layers. This is displayed in our code below, in which there are multiple additions of p to itself.
If you’d like to delve deeper, check out the excellent documentation at http://docs.ggplot2.org
So far, all we’ve done is recreate our maps-based plots with
ggplot2, and it doesn’t look that much different than the maps package version. Next we’ll look into how we can use other
ggplot2 features to better visualize our data so that the map gives us a little more information.
One thing we can do is adjust the circle sizes so that they tell us more about the concentration of users in a particular area and allow us to see their overlap, which will take care of the overplotting issue.
To do this, we first selected locations with more than 9 githubbers (first line). We then adjusted the circle sizes by passing the scale_size_continuous function to a ggplot object, with the following arguments:
We now have a map that clearly shows us areas with a significant amount of urban sprawl (Los Angeles and Washington DC, for example). Contrast these with Boston and New York, where we find a concentration of overlapping bubbles in a much smaller area. We now have a pretty good idea of where most of the Github users in the USA are based as well as roughly how many of them are in each of these areas.