Proof: git project hosting on CDN

Here's a proof of how git projects can be hosted on a CDN. Specifically, a cloneable repo and HTML code browser are hosted on Rackspace Cloud Files (other CDN's like s3/CloudFront would work too). Project collaboration now takes place over email, which works for the Linux kernel and should work for most small projects also.

Before going further, here's the end result, a browseable and cloneable repository of a small curses game I wrote recently.

Many, many, many git code projects are hosted outside of walled gardens like Github. The reasons include:

CDN's are among my favorite utility technologies. This blog is hosted on CloudFiles, and the prospect of hosting code projects is interesting for all the same reasons. A CDN is a low level workhorse and unlikely to have outages or security issues, compared to "smarter" code hosting solutions. I don't have to maintain a server, and the monthly charges are a few quarters and dimes.

While using a CDN doesn't technically count as self-hosting, it is by no means a walled garden either. The CDN's out there are close to being interchangeable, and exporting is easier than importing. Anyway, the solution here could be self-hosted if you want.

Code browser

Git ships with a CGI script for browsing repos over HTTP, but we want a tool that generates the static HTML. A quick web search turned up the competent git2html. While the tool doesn't feel quite finished (TODO: CSS), it's very functional.

Let's run the tool on a small repo, and look at the output locally.

erik@msi ~/tmp $ git2html.sh
Usage /home/erik/src/git2html/git2html.sh [-prlbq] TARGET
Generate static HTML pages in TARGET for the specified git repository.

  -p  Project's name
  -r  Repository to clone from.
  -l  Public repository link, e.g., 'http://host.org/project.git'
  -b  List of branches to process (default: all).
  -q  Be quiet.
  -f  Force rebuilding of all pages.
erik@msi ~/tmp $ git2html.sh -p mountain -l http://foo.example.com/mountain.git -r /home/erik/src/mountain mountain
Rebuilding all pages as output template changed.
Cloning into '/home/erik/tmp/mountain/repository'...
done.
Note: checking out 'origin/master'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 4ca5f13... Explicitly set foreground/background for terminals that need it
Deleted branch master (was 4ca5f13).
From /home/erik/src/mountain
 * [new branch]      master     -> refs/origin/master
warning: refname 'origin/master' is ambiguous.
HEAD is now at 4ca5f13... Explicitly set foreground/background for terminals that need it
warning: refname 'origin/master' is ambiguous.
Branch master (1/1): processing (2 commits).
Commit 4ca5f1339aaf6b69cffe06c77f0e5259aa3897f1 (1/2): processing.
Commit 4b83f06f9041e9445f0d15930ea42ad9fbc96f7a (2/2): processing.
erik@msi ~/tmp $ git2html.sh -p mountain -l http://foo.example.com/mountain.git -r /home/erik/src/mountain mountain
warning: refname 'origin/master' is ambiguous.
HEAD is now at 4ca5f13... Explicitly set foreground/background for terminals that need it
warning: refname 'origin/master' is ambiguous.
Branch master (1/1): processing (2 commits).
Commit 4ca5f1339aaf6b69cffe06c77f0e5259aa3897f1 (1/2): processing.
Commit 4ca5f1339aaf6b69cffe06c77f0e5259aa3897f1 (1/2): already processed.
Commit 4b83f06f9041e9445f0d15930ea42ad9fbc96f7a (2/2): processing.
Commit 4b83f06f9041e9445f0d15930ea42ad9fbc96f7a (2/2): already processed.

erik@msi ~/tmp $ find mountain -type f | wc -l
49

You see that the script starts with a checkout into a "repository" directory. We'll push this along with the generated markup, and the world can clone from it (with some extra steps).

I have some git warnings that I frankly haven't bothered to debug since the output seems to be correct anyway.

Running the tool a second time results in some "already processed" lines, showing that the markup is updateable and tries to skip regenerating markup when possible.

Finally, notice that 49 files are generated, which seems a little excessive for a repo with two commits and two files in its tree. After testing with some larger repos, it looks like eight files are generated per commit. This is the cost of pregenerating every diff, every tree, etc.

Syncing to CloudFiles

The file count made me look again at sync tools. For the blog engine I wrote some script using the pyrax python lib, which is Rackspace's sanctioned lib for their cloud API (which differs in subtle ways from a stock OpenStack implementation). But this approach is slow and unidirectional. Searching for other sync tools, I was disappointed at the slim pickings, and was preparing to write something in erlang.

I searched one last time and hit the jackpot with cloudfuse, a FUSE (filesystem in userspace) implementation for Linux. It built without warnings and worked the first time I tried it. On top of that, it's much faster than I was expecting. I've put it into my fstab for easy re-mounting on demand (the cloudfuse tool itself isn't required):

erik@msi ~/tmp $ cat /etc/fstab
...
cloudfuse /home/erik/var/cf fuse username=foouser,api_key=fookey,region=ORD,user,noauto 0 0
...

Cloudfuse isn't a perfect solution. For example, I've given up fine-grained control of the MIME types that I enjoy with pyrax. It's good at quickly and easily syncing a lot of files, though, so I can overlook this minor problem.

A nice property of cloudfuse is that it's bidirectional. If I want to migrate off of CloudFiles, retrieving the whole thing is as trivial as a tar command. I'm also quite sure that FUSE modules are available for other CDN's, making the data easily portable.

Fixing minor problems

Trying to run git2html.sh directly in a cloudfuse mount results in a "Operation not implemented" error. It turns out that git2html.sh very reasonably symlinks HEAD to the parent commit in the HTML output, but a CDN has no notion of a symlink. The solution was to generate to a staging directory, and then copy to the cloudfuse mount with cp --recursive --dereference which causes symlinks to be followed through to their backing files.

A second fix is required in the staging directory. While git2html.sh has cloned the source repository, we ultimately want folks to clone from the CDN, which git calls a "dumb HTTP" transport. We must run git update-server-info, which results in git generating whatever packed metadata is required, things that "smart" transports can generate on the fly.

Thirdly, folks are used to cloning from bare repositories, whose clone path ends in "project.git". However, git2html.sh clones non-bare into "proj/repository/.git". Thus, the correct argument for the -l command looks like "http://blog.mackdanz.net/code/mountain/repository/.git" and results in the user cloning to a directory called "repository", regardless of the project name. Fortunately, git2html.sh isn't picky about the value of its -l argument, so we'll specify it with two args, appending the final clone argument so that the cloning user clones to a directory named for the project. In other words, we'll invoke this:

git2html.sh -p mountain -l 'http://blog.mackdanz.net/code/mountain/repository/.git mountain' -r /home/erik/src/mountain mountain

so that the clone instructions render like this:

Clone this repository using:
  git clone http://blog.mackdanz.net/code/mountain/repository/.git mountain

Automation

I like to keep my home directory looking Unix-y: code goes in ~/src, data in ~/var, scripts in ~/bin, config in ~/etc, etc.

Having all my code in a flat tree under ~/src makes it easy to create a script (~/bin/synccodetocdn) that can publish any code project to CDN:

#!/bin/bash

set -e
set -x

mountpoint=~/var/cf

if ! (findmnt $mountpoint >/dev/null); then
    echo "No cloudfuse mount.  Run:"
    echo "  mount $mountpoint"
    exit 1
fi

projroot=$mountpoint/blog/code
stageroot=~/tmp/stageroot
mkdir -p $projroot $stageroot

function syncrepo() {
    local repo
    repo=$1
    pubcloneurl="http://blog.mackdanz.net/code/${repo}/repository/.git ${repo}"
    pushd $stageroot
    git2html.sh -p $repo -l "$pubcloneurl" -r ~/src/$repo $repo

    # Required for a "dumb" http transport to be cloneable
    pushd $repo/repository/.git
    git update-server-info
    popd

    popd

    pushd $projroot
    # --dereference b/c no symlinks allowed in CDN
    cp --recursive --dereference $stageroot/$repo .
    popd
}

repos="mountain"

for repo in $repos; do
    syncrepo $repo
done

One inefficiency here is that cp will write every file, even unchanged ones. Since CloudFiles supports Etag, it should be possible to push only new/changed files. Maybe cloudfuse's "stat" call could include an etag check, or it could report the Etag (sometimes implemented as an encoded date) as the file's ctime. Maybe this is somehow handled already and I just need to change my cp to rsync (which can also delete files from the target, a necessary behavior cp can't provide).

Changing workflow

I've collaborated over email, specifically when sending a patch to password-store, although I haven't done it regularly for a project of my own.

Git supports it well, and documentation is plentiful:

Note especially that git has a "request-pull" command, which predates Github and works over the internet. Github's pull requests are internal only, designed to be a lovely wall for their garden.

Summary

This project demonstrates that it's possible to host major parts of an open source code project on a static CDN, deferring to email for the collaborative bits.

I'm not sure what I'll do with this next. I could say "proven" and be done with it. At the other extreme, I could move all my Github code to CloudFiles, and do the same with some home projects. That amount of effort feels premature for something I'm calling a proof. I have some interest now in maintenance of git2html.sh and cloudfuse, both of which could use a few patches from me and also a friendly ebuild. Stay tuned, I guess.