Part 1: Exploring Github Demographics

This is the first post in a series of data science tutorials involving GitHub profile data.


We explored 2 million Github public profiles and discovered some interesting differences in usernames between active and inactive profiles.

Github is an online platform for code collaboration and project management. Their service is used widely enough that a presence on Github could get you a job. It may also be a target for programmatically created “spam” profiles to serve the never-ending motives of spammers.

We began with an open-ended exploration of Github's public user profiles.

Github contains an enormous amount of information on the state of the open-source world. The frequency of code commits and the number of followers present a compelling picture of a project's popularity. At the individual level, almost all public information is available through a user's RSS feed or the Github API.

With the goal of better understanding Github demographics, we parsed the first 2 million public user profiles on Github. The most popular user, in terms of followers, is Linus Torvalds, author of the Linux kernel and git. In second and third place are the founders of Github, Chris Wanstrath and Scott Chacon.

Users with the most public repositories (repos) on Github have largely mirrored public repositories and archive networks. For example, CPAN (perl) and PLD linux have been forked by many users. The person with the most public repos is the user pombredanne with more than 6000 public repos.

We found inactive "ghost" users and analyzed their usernames.

Having investigated the most popular users, we turned our attention to the other end of the spectrum. Around half of all the registered users appear inactive: they have never publicly shared any code. We defined an active user as having at least one follower, public repository, or gist, or following at least one other user. Inactive users did not meet any of these criteria. Of the inactive profiles, 80% not only have zero activity, but are missing all non-required profile information. We called these users "ghosts". It is possible that these users only commit to private repositories, but we have no way of evaluating that hypothesis.

Since we have little information on the ghosts, we compared the character frequencies (as share of all characters) in their usernames to active users.

Character frequency in Github usernames

Distribution Barplot

Frequency of username characters differ significantly between the active users and ghosts. All letters except q, x, y, and z appear more frequently in active usernames. All numeric digits appear more frequently in ghost usernames. This finding is curious.

One explanation for this phenomenon is software generation of ghost usernames. Much like Panabee generates semi-random company names, a hypothetical program might choose a first and list name from a set, abbreviate the first name to one character, and add numbers to the end.

In theory, there exists some latent process by which people, in aggregate, generate usernames. Some individuals use a first initial plus last name. Others replace 'e' with '3' and 'i' with '1'. In aggregate, these human algorithms follow some unobserved character frequency distribution. If a software username generator was not properly calibrated to the true process then we would observe a discrepancy in character frequencies between real and computer generated users.

Ultimately, ghosts' usernames seem more likely to be auto-created than active usernames.

In summary, active and inactive users on Github demonstrate distinct distributions of username characters. We speculate that this difference results from the programmatic creation of users and usernames. Why someone would want to create fake Github users is unclear. The main restriction of free Github accounts is that the repositories are public. Additional user accounts provide no additional privacy.

This is the first in a series of articles on social coding communities. Read part 2 in the series Parsing Github Profiles .