We analyzed 2 million GitHub public profiles and discovered some notable differences in username structure between active and inactive profiles.
Github, an online platform for code collaboration and project management, is used widely enough in the industry that simply having an active profile can lead to job offers. As such, it is also fertile ground for systematically-generated ‘spam’ profiles created to serve the obscure and constantly-evolving goals and motives of spammers.
Github contains a massive amount of information on the state of the open-source coding world, with frequency of code commits (saved revisions or changes to a file) and number of followers presenting a largely accurate picture of a project’s popularity. At the individual level, most public information is available through a user’s RSS feed or the GitHub API.
To better understand Github demographics, we parsed the first 2 million public user profiles. We discovered that Linus Torvalds, author of the Linux kernel and git, is the most popular user in terms of number of followers, followed by Github founders Chris Wanstrath and Scott Chacon.
We also found that users with the most public repositories, or repos, on Github have largely mirrored public repositories and archive networks. For example, CPAN (perl) and PLD linux have been forked by many users. The user with the most public repos is the user pombredanne, with more than 6000 public repos.
After analyzing the most popular users, we turned our attention to the other end of the spectrum. Nearly half of all Github users appear to be inactive, that is, they have never publicly shared a single line of code. We defined an active user as being one that has at least one follower, public repository, or that follows at least one other user, and defined inactive users as those that did not meet any of these criteria. Of the inactive profiles, 80% not only have zero activity, but are also missing all non-required profile information. We called these users ‘ghosts’. While it is possible that these users only commit to private repositories, this explanation seems dubious, and in any case we have no way of evaluating that hypothesis.
The ghosts provide very little in the way of personal details. Working within a very narrow range of information, we decided to go with something basic and compare character frequencies (as share of all characters) in ghost user names to character frequencies in active usernames.
What we found was that, with the exceptions of q, x, y and z, all letters appear more frequently in active usernames, while the numeric digits appear more often in ghost usernames. Why is this?
One possible explanation is that the ghost usernames are software-generated. In the same way that Panabee generates semi-random company names, a hypothetical username-generating program might choose a first and last name from a set, abbreviate the first name to a character, and add a number to the end.
How would this compare to the way that actual humans do this? In theory, there is some latent process by which most people generate usernames. Some people use a first initial plus their last name, others replace ‘e’ with ‘3’ and ‘I’ with ‘1’ and so on. In aggregate, these human algorithms would follow some unobserved character frequency distribution. If a software username generator was not calibrated to perfectly mimic the ‘real’, human process, then we would observe a discrepancy in character frequencies between real and computer-generated usernames. Having observed such a discrepancy, it seems likely that Github’s ghost usernames were auto-created.
Why would anybody want to create ghost profiles on Github? We have no idea. The main restriction placed on free Github accounts is that the repositories for such accounts are public. Opening additional user accounts will not provide an individual user any additional privacy. When it comes to free accounts, more is not better. Given that spammers are a shady and secretive lot who generally do not make themselves available to discuss their work, we may never know the true reason for the existence of the ghost profiles.
Stay with us for Part 2, in which we show how we ingested the data from 2 million Github profiles.