Gitter: Toward Twitter on Git(hub)

tl;dr -- I love Twitter and I love git, and wouldn't it be nice if I could build the former on the latter? What follows is a flight of fancy for your Friday feed reading...

Introduction

I went on a bike trip with my family last weekend, and when I came home, everyone was in flap about how Twitter was going to screw third-party developers. Numerous people started the familiar chant about how Twitter is a core infrastructure of the web, that a for-profit company driving it is going to end badly, etc. I have to say that I agree with all of this. Twitter seems too important to me to have its direction come from the need to make money.

How do you recreate Twitter? This isn't an "I could build Twitter in a weekend" post, since I have too much respect for the kinds of problems they are solving, the scale involved, etc. Instead, this is a thought experiment.

If you're a developer, you're probably already using git and github. If you don't use them, think of git as a database, and github as free hosting for your database. There are lots of free git hosting options, but github has captured the hearts and minds of the developer community. Github lets me, you, and your dog (your dog needs a Twitter account too, right?) host open source projects for free, and gives us 300M of storage to do so.

So, what is a tweet? Let's simplify and say it's max 140 characters, written by a particular person, possibly in response to another tweet, at a particular time, potentially containing links, and maybe links to Twitter objects. And what is a git commit? A (possibly empty) set of file changes, with an author (and potentially different committer, which will be important below), a commit message, a date, and a globally unique identifier sha (e.g., 78dc050439110a2949f1d32df2c69962dcf92258). A commit message is typically meant to be human readable text, and a commit to contain files; but none of those has to be true. With a bit of cutting, some JSON, and imagination, the two are compatible.

My basic idea is to use git clones/forks, hack commit messages, and store metadata in files (details in a minute). More importantly, the aim of gitter would be to enable the ease of social networking without the requirement that one give up his or her control over content. I would own my gitter content, and manage it. By using a distributed architecture, and one which already scales and has redundant backup via parallel scaled infrastructure, gitter hopes to reduce the need for centralization. Gitter would also allow for open licensing of content, and more, personal control over licensing. Gitter eliminates unwanted ads, spam, etc. by putting each individual in control of the users they follow.

Gitter

Gitter uses git as a distributed filesystem, and github.com as primary git repo host (note: other git hosts are possible without affecting the architecture). Every gitter user has a fork/clone of the gitter repository. All gitter-messages are stored as JSON strings in the log message of an empty commit. All gitter-message commits are done to a "stream" branch in sequence. As a result, every gitter-message is globally unique, and identifiable via a git commit sha, and includes metadata such as author, date, etc.

It is important to emphasize that git and github are used as a backend, but in no way do I assume users will interact with either directly (though you could). Instead, operations on git and github will be done via clients. Using git and github gives more than 1 million users (I suspect it's much higher now) instant access to the gitter messaging system without creating a new account, and free, open accounts for anyone not already a user.

Users

Users of gitter begin by forking the gitter repo. This gives them a repo with two branches: master and stream. The master branch contains a number of files.

The first file contained within each gitter repo's master branch is profile.json. The profile.json file contains information about the user, including things like:

{  
  "name": {  
    "first": "David",  
    "last": "Humphrey  
  },  
  "login": "humphd",  
  "nickname": "humph",  
  "twitter": "humphd",  
  "license": "http://creativecommons.org/licenses/by-nc/3.0/",  
  "email": "david.humphrey@email.com",  
  "urls": [  
    "http://vocamus.net/dave"  
  ],  
  "locale": "en-CA",  
  "timeZone": "GMT-05:00",  
  "bio": "Bio for David Humphrey"  
}

The profile.json file describes a gitter user, and indicates important hints for clients, for example the default license for all messages written/published by this user.

The second file contained within each gitter repo is following.json. This file contains the list of other gitter users, and their repos, that the current user follows.

The gitter repo also includes an orphaned, empty branch called stream. This is the branch where gitter messages go--more details follow.

Following Users

Every gitter user has a fork of the gitter repo. As a result, every user has a unique URL, for example:

git://github.com/someone/gitter.git

Following a user means doing two things:

adding their forked repo as a read-only remote to your gitter fork:

git remote add someone git://github.com/someone/gitter.git

adding the user's repo information to following.json in the master branch

Here the user at git://github.com/someone/gitter.git was added using their default github username. However, the only thing that matters globally is the URL--we can use whatever name we want. Unlike Twitter, G+, Facebook, etc., where two people can't share the same name, in gitter you can refer to people using any common name you want. If I know git://github.com/someone/gitter.git as dave, I can use that shortname. In fact, users can provide a preferred nickname in their profile.json file as a clue to clients. In other words, we can:

git remote add [nickname] git://github.com/someone/gitter.git

This means that two users can follow the same person (i.e., follow git://github.com/someone/gitter.git) and choose to refer to him/her with different names. The gitter client will make sure there is no confusion internally, by always using URLs instead of short names.

The remotes allow git to manage internal links to other users' repos, but don't provide visibility to other users--there is no way to find out who you follow. Therefore, we also keep a list of other gitter users we follow in the following.json file. The file looks like this:

[  
  {  
    "user": "jbuck",  
    "nickname": "jon",  
    "twitter": "jbuckca",  
    "url": "git://github.com/jbuck/gitter.git"  
  },  
   {  
    "user": "someone2",  
    "twitter": "humphd",  
    "url": "git://github.com/someone2/gitter.git"  
  }  
]

The following.json file is an array of objects describing the users this user follows. Each user object contains a user field, which is the github username for this user. Next comes an optional nickname, which allows us to specify a different name for this user than the one they use on github. Next comes an optional twitter username, in case one wants to map or cross-post messages from gitter to Twitter. Finally, a url for their gitter repo fork.

Unfollowing means removing a user's remote from the forked gitter repo of a user, as well as from following.json on the master branch.

Messages

A message is an escaped sequence of text. No HTML is allowed. Messages may be any length, but 140 characters is encouraged for compatibility with Twitter. Messages are stored in JSON strings. Here is the most basic example:

{  
  "text": "Writing a message in gitter!"  
}

Publishing this message involves doing an empty commit on the stream branch of the user's fork of the gitter repo:

git checkout stream && git commit --allow-empty  
  --message='{"text":"Writing a message in gitter!"}'

And then pushing:

git push origin stream

And in github you get:

A message can be simple text. A message can also contain various types of data, and refer to external data.

A message might include a URL, which is itself simple text:

{  
  "text": "Writing a message in gitter that  
           includes a link to http://gitter.org"  
}

A message might include references to external gitter objects, for example, another user:

{  
  "text": "Writing a message in gitter that refers to  
           @jon and @humph.",  
  "users": {  
    "jon": "git://github.com/jbuckca/gitter.git",  
    "humph": "git://github.com/humphd/gitter.git"  
  }  
}

In this example, two gitter users are referenced in the text. The author has used the nicknames @jon and @humph, which in turn refer to URLs of remote gitter repo forks. Other clients displaying this message might choose to show the nicknames differently, for example, if another user knows Jon as jbuck instead of jon. Internally the mapping will be back to the URLs. By cross-referencing the following.json file for this user, other information can be obtained about each user, for example, a Twitter username.

A message might include a reference to another gitter message. This can happen when a user Replies to another message, References something someone else said, or does the equivalent of a Retweet (RT):

{  
  "text": "Writing a message in gitter that refers to  
           another message @jon wrote",  
  "users": {  
    "jon": "git://github.com/jbuckca/gitter.git",  
  },  
  "references": [  
    "git://github.com/jbuckca/gitter/commits/ec52b6907a1988f042eb61ac3450cf8361ef60b4",  
  ]  
}

Here I can track back the author of a message referenced in the current message. The fact that messages can reference one another removes the necessity that I copy things into the text itself--the client can surface this information alongside the message. In this case the original message can be displayed alongside the current message, in order to preserve conversational context.

The concept of a Retweet in Twitter can be done using git's ability to cherry-pick commits from one branch to another. If I want to re-broadcast a message from one of the users I follow to the users who follow me, I only need to cherry-pick the commit from their remotes//stream branch onto my stream branch. Git will keep the author and committer information, thus making it possible for the original author's info to get recorded in my message stream.

All messages are published under the user's desired license. This is either the default gitter license (TBD), the license specified in the user's profile.json file, or the license specified in the message itself:

{  
  "text": "This is a message for which I want  
           to specifically use license XYZ",  
  "licenses": [  
    "http://licenses/xyz"  
  ]  
}

In this case the license(s) for this message overrides any other implied license for this user or gitter in general. It is up to the client to surface this information.

Updates

As users write and publish messages, their followers keep up-to-date with them via

git fetch --all

This pulls in all commits, and thus messages/payloads, from the users (i.e., remotes) this user follows. Every user's messages exists via the remotes//master branch pointer. Accessing and using the messages on these branches is possible without ever merging.

Other ideas

It would be wonderful to leverage Github's gh-pages branches in order to automatically publish a web-reachable version of a user's stream. In this way, everyone would have a standard URL where I could go and look at their messages, http://humphd.github.com/gitter/ is akin to https://twitter.com/#!/humphd. Providing a "Follow Me" link on this page, a way to see who this person follows, etc. would also be nice.

What about publishing things referred to in messages? If I want to share an image or a PDF, and don't have a place to host it? Because all messages are associated with commits, each message can also include a file or files.

{  
  "text": "This is a message which has a PDF payload",  
  "payload": [  
    "git://github.com/jbuckca/gitter/commits/ec52b6907a1988f042eb61ac3450cf8361ef60b4/document.pdf"  
  ]  
}

Here the user has chosen to publish the PDF document to which they refer in the message. This document, document.pdf, is committed along with their message. Followers of this user can choose to obtain the document remotely (i.e., via payload URL), or can save it locally (e.g., git fetch). In this way distributed publishing of content larger than appropriate for a message can be done easily. Another way to one-click publish media to the web would be to use Github's Downloads page: http://developer.github.com/v3/repos/downloads/

Problems

Of course there are all kinds of problems I'm skipping over. Let's consider some of them.

How do we support conversations among people who don't follow each other? I often have people who I don't follow reply to things I write on Twitter. It's a great aspect of the platform, and one I don't have a good solution for above.

How do we deal with Github's rate limiting? You only have 5000 API calls per hour. I wrote a node.js client for what I did above, but it's unrealistic to go this route without having the ability to do more calls.

If we don't use the Github API, and do raw git repo commands, how do you do this on mobile? It's probably not realistic to have a git repo on your mobile phone due to size. Maybe that's only an issue today, and won't be in a few years.

How do we manage state between browsers/devices? That is, if I've read your tweet at 3:00pm, and I move to my mobile phone, how do I skip pulling that one in? I'd need to record the commit sha of the place I left off, but where to put that? I could put it in the following.json file, but that means I have to push a ton of commits to my repo every time I get new messages.

This also raises the problem of bandwidth. I currently follow 265 people on Twitter. In order to get updates, I need to pull in commits from 265 repos, or do 265 calls to the Github API. And I need to do that every N minutes (1, 2, 3?). That's potentially a lot of traffic. How to reduce it?

Conclusion

I've actually written some code to experiment with the above, and it's not really workable yet. However, I'd love to see it get to a state where it is workable. There should be no reason why I can't have complete history of my messages and those I follow, why I shouldn't be able to specify a license for what I write, and I shouldn't be dependent on a single company to control my communication.

I'm looking forward to using gitter some day soon.

Thanks to Jon Buckley for helping me think through some of the details above.