Part IV - Mapping the USA

12 MIN READ - 1/4/2017

Using our dataset of Github user longitudes and latitudes from Part 3, we visualize and analyze the distribution of user locations using R with the maps and ggplot2 packages.

In this post, we map Github profiles across the USA. Our first step was to generate a map of the USA using the maps package. We then added a layer of points, with longitude as x and latitude as y using the points function in R, with a point size of 0.1 and an alpha of 0.75. At this point size, the points are visible and overplotting does not seem to be an issue.

Simple Map of USA

As with most analyses, this result is only a starting point; from here we can modify the map to provide more information. We could, for example:

  • add state and national borders
  • add labels for large cities
  • adjust plot margins and add a title

The mapdata package, which supplements the maps package, can help us to achieve some of these goals.

Simple Map of USA

Github users are clustered in major cities

Adding state boundaries and cities brings a little more order to our map, and we can now confirm what you may have already suspected-that Github users are clustered in major cities, particularly the San Francisco bay area, Los Angeles, Chicago, the Mid-Atlantic region and New England. The largest clusters are all found in densely populated urban centers.

There is still an issue with our map, however-many of the city labels are overplotted. The Mid-Atlantic, San Francisco and Los Angeles regions, for example, look like big blots of ink (or, if you prefer, pixels) and do not give us much in the way of specific information; they simply tell us that there is a high concentration of Github users in these areas. We can fix this in a couple of ways.

First, we’ll change the dots to circles whose size correlates to the number of Githubbers. To do that, we’ll need to transform the geographical coordinates to calculate their frequencies. We do this with the data.table R package:

Modern R maps with ggplot2

Next we’ll redo the maps that we made, only this time instead of the maps package, we’ll use the more modern and versatile ggplot2. Maps is based on the original graphics package (called, unoriginally, graphics) included with R. Graphics is fine for simple visualizations, but it does not suit us here because it does not support layers. Whatever we add to the original graphics-produced map will appear above our earlier additions, creating a cluttered and confusing mess.

ggplot2, on the other hand, is built on top of the grid package. grid is a newer package than graphics and allows for a more object-oriented approach to constructing data visualizations. ggplot2 further builds upon grid by providing a high-level interface to mapping data with visuals. The chief strength of ggplot2 is that we can customize our graph by adding layers. This is displayed in our code below, in which there are multiple additions of p to itself.

ggplot2 map of USA with dots

If you’d like to delve deeper, check out the excellent documentation at http://docs.ggplot2.org

ggplot2 map of USA with custom theme

So far, all we’ve done is recreate our maps-based plots with ggplot2, and it doesn’t look that much different than the maps package version. Next we’ll look into how we can use other ggplot2 features to better visualize our data so that the map gives us a little more information.

Github locations show urban sprawl

One thing we can do is adjust the circle sizes so that they tell us more about the concentration of users in a particular area and allow us to see their overlap, which will take care of the overplotting issue.

ggplot2 map of USA with custom theme and density circles

To do this, we first selected locations with more than 9 githubbers (first line). We then adjusted the circle sizes by passing the scale_size_continuous function to a ggplot object, with the following arguments:

  • For scale breaks: `breaks=c(50, 100, 500, 1000, 2500, 7000)``
  • The transformation type: trans="sqrt"
  • The point size range: `range=c(1,10)``
  • Change the label: name="Count"

We now have a map that clearly shows us areas with a significant amount of urban sprawl (Los Angeles and Washington DC, for example). Contrast these with Boston and New York, where we find a concentration of overlapping bubbles in a much smaller area. We now have a pretty good idea of where most of the Github users in the USA are based as well as roughly how many of them are in each of these areas.

Find out how we can help you with your next project