Distributed version control is not my favourite technology
Wed 11 Jan 2012 by mskala Tags used: programming, softwareNot too long ago a free software project I'm peripherally involved in decided it was time to replace its old and not broken version control system with something new and broken, and the lead maintainer conducted a straw poll of what the new system should be. My suggestion of "anything, as long as it's not distributed" was shouted down by the chorus of "anything, as long as it's distributed." Having lost the argument in that forum, I'm going to post my thoughts on why distributed version control sucks here in my own space where it's harder for me to be shouted down.
Before I go further, I'd like to say that I am not here to tell you distributed version control sucks all the time, in general; only that it sucks in certain specific use cases, which I will outline - and those specific use cases happen to occur very frequently. I like Linux and I use Linux a lot; as a result I hold some fondness toward Linus Torvalds; but I'm not going to ignore the deficiencies of git just because it's his baby. I use SVN, but I'm not here to tell you that SVN is perfect or that it's always better than git, either. This isn't even about git versus SVN. It's about distributed versus non-distributed version control. However, since git and SVN are the most popular examples of distributed and non-distributed version control, some of their specific issues are highly relevant to a large number of users.
Distributed version control sucks for distributing software
Nearly all users of version control are non-developers.
It's easy to lose sight of that fact when choosing a version control system, because version control systems are chosen by developers, who naturally think of their own needs first. Version control systems are also developed by developers, who naturally think of their own and other developers' needs first. If we think of version control systems as being for developers, then when we think of use cases we think of developers' use cases; and in particular, we think of read-write kinds of operations like checking in a patch, or splitting and merging branches. What we, the people who choose and develop version control systems, do not generally think of is the following use case.
Here's Joe. Joe is reasonably smart, and he took a class called "computer science" in high school, and as a result he basically knows what a compiler is for, and he is able to invoke one given a reasonably well-behaved source tree. He's been dual booting Linux and Windows for several months, and these days he finds he only ever bothers with Windows when he's playing World of Warcraft, which doesn't even hold his interest much anymore; he's thinking of reformatting that partition. Joe has never written or modified a program over 200 lines in his life, and he isn't particularly likely to start doing that today. He wants to use the Frobozz open-source software package, but the current distributed binaries conflict with libraries on his system, and the last source tarball is three years old. Joe has the motive and ability to compile the latest development version of Frobozz, but Joe is not a developer.
What's going to happen, because this is how our world works these days, is that Joe is going to be told he ought to check out a copy of Frobozz from version control. The version control system will be used for the important function of getting Frobozz into the hands of a non-developer. Version control replaces what was once done with tarballs transferred over HTTP, let alone FTP. The longer a project remains under version control, the greater will be the tendency for its developers to use the version control system as a replacement for file releases - because they face a choice between boring packaging work to push each new distribution, or continuing to do the fun parts of development and never having to push a distribution at all. Distribution through a version control system can be just a side effect of normal development activities that developers would be doing anyway, and that way users can always get the latest version easily without having to wait for a release. At a glance it looks like a huge win for all parties.
Many successful software packages encourage non-developers to get their copies from version control systems rather than through other channels. The Linux kernel (especially post-2011, when kernel.org fell down, stayed down, and remains Not The Recommended Way) and MPlayer (notably the only Linux video player that actually works from a naive user's perspective, because the others don't play Windows codecs) are just two examples. Both are widely used by non-developers who nonetheless want to remain up-to-date with recent versions. I listed many more, including several essential KDE dependencies, in an earlier version of this article, but it was a boring list. I hope most readers can agree without the list that we all frequently need to get things from version control, even when we aren't developers of the packages in question.
As software packages become bigger and more successful, the ratio of non-developer users to developers tends to grow. Something like Tsukurimashou has one developer and I'm not sure there are any users other than myself. The Linux kernel has thousands of developers but millions of users. That means that on a large, popular project, people checking out copies that will be effectively read-only, just used to substitute for a tarball, will vastly outnumber people who check out copies to participate in development. Most actual use of the version control system will be for distribution, not for development. Joe is the 99%.
So here's the first obstacle for Joe, if Frobozz's developers have chosen a distributed version control system such as git: distributed version control is distributed. What Joe needs to do is "clone" or "check out" or whatever you want to call it, from the central server. But there is no central server. And that is not just an unfortunate accident; it is the very definition of distributed version control. There is no the central server; Joe must find a repository among potentially many of equal status. (Confusion between "the" and "a" is Vining's Oversight, which I named after Nicholas Vining from Gaslamp long before he was rich and famous. I'm not sure whether he wants to remember that, but businesses usually like getting shout-outs.)
Most likely the Frobozz developers are using git, and if they are doing that, then most likely they're using github. I love the name "github" because at just six letters, it is one of the shortest bizarre nonsensical self-contradictions I've ever seen. Just step back and think about how brilliant the existence, let alone popularity, of a thing called "github" actually is.
Ladies and gentlemen, I give you "github." Git. Hub. A hub for git. A hub for git, no fooling! The purpose of github is to provide a distinguished central nexus for a system whose main intentional noteworthy design feature was supposed to be its lack of a distinguished central nexus. The basic function of github is to deny the very foundation of what git stands for. I'm reminded of an establishment named "Vegetarian Fast Food Restaurant" that existed in a town where I used to work. They served meat. The halfway-reasonable justification was that many of their customers wanted meat. They managed to stay in business, but it's not clear to me that their marketing strategy was the best one possible.
Should you advertise, as your very definition, something that you will compromise in a drastic way in order to accommodate a customer population that doesn't want it? As a customer, should you do business with an establishment that offers what you're looking for, or with one that advertises not providing what you want as their main definition of themselves? Since we know that the large majority of users (namely the non-developers) need to use the version control system in a centralized way (namely to get updated versions), we should think twice about whether not being centralized should be a positive definition of what we want from the version control system. Nonetheless, developers keep putting "distributed" on the shopping list as a desired feature in itself rather than for its purported benefits when they go looking for version control systems, and they keep choosing git.
The second obstacle for Joe is that he has to download and store the entire history of all versions of Frobozz just to get a copy of the latest one, because that's what cloning a repository means. This data is of no use to him; he only wanted the latest version. But he has to download and store all versions that have ever existed, and he'll be told this is a feature. It's a feature only useful to developers, when they want to look back at past history, and even they only do that rarely. But every user (as well as the central server that we pretend doesn't exist) is forced to pay on all checkouts, which are the most frequent operation performed, for a feature that is only useful rarely to a small minority of participants whose proportional involvement shrinks as the project grows. Maybe causing that waste to occur should not be a positive design priority.
The third obstacle for Joe is that there are so many distributed version control systems to choose from. Maybe Frobozz isn't using git. Joe has to go download Mercurial. And Python to make Mercurial work. Now he has two extra software packages to install, each much bigger than Frobozz, just so he can download Frobozz before he even gets to start on Frobozz's own dependencies. If you think Joe already has Mercurial, Frobozz will turn out to require Bazaar. If you think Joe already has Python, Frobozz will turn out to require darcs, and thereby Haskell. There's no end to this game. The profusion of required client software is of course an issue when any version control systems are used for distributing software to users, not just with distributed version control systems; and the profusion of languages is a problem with software in general; but being distributed is not entirely unrelated to the issue.
Because distributed version control systems suck so much, there is, more than with non-distributed version control systems, an incentive for a developer to think "I can do it right!" and create a new one. And because developers think distributed version control is cool, they're going to want to mix it with other technology they think is cool, such as their favourite bizarre languages. Don't ask me about my own ideas for the Prolog-based version control system. Users, in general, end up having to suffer through all this coolness just to download the damn source code of the end-user package they wanted to compile.
Distributed version control sucks for distributed development
We might be able to wave our hands at the issue of distributed version control's unsuitability for distributing software to non-developers by claiming that that's not what version control is for. To do so ignores the observed fact that people do use version control as their preferred means of conveying updates to non-developers whether that's what it's "for" or not, and so we will be in some amount of hurt if it turns out to suck when applied to that purpose; but maybe we are egotistical enough to claim that that's all everybody else's fault and not ours. A harder issue to ignore is that distributed version control also sucks in the rare but important case of distributed software development; and what is it "for" if not that?
I do most of my software development on one of two computers depending on whom it's for: stuff for my employers is mostly done on a desktop machine in my office, and stuff for my own projects is mostly done on a desktop machine in my apartment. Because of the nature of my work (academic research) there is some amount of overlap between these two general categories. I also have a laptop computer that I carry around and on which I sometimes do development work, with checked-out working copies from the same repositories that serve the home and office desktop machines. Sometimes I'll have updates made in more than one place on the same code, which then have to be merged, and so on; even though there's only one of me, I end up having to coordinate work in something like the same way a team of more than one developer might have to. And the laptop doesn't always have a reliable network connection, especially not when I travel. All this sounds like it might be claimed to be an ideal chance to use distributed version control. What I'm doing is well described as distributed development.
Keeping important archival information on my laptop is a bad idea because I might drop it, or it could be lost or stolen. It's a rule that the only copy of anything important, doesn't live on the laptop. So, obviously, the only copy of the version control repository must not live on the laptop. That is no big issue for distributed version control, because of the multiple repositories. There will be other repositories elsewhere than on the laptop and I just have to make sure they're synchronized reasonably frequently.
I also can't keep valuable or confidential information on my laptop indefinitely because whenever I enter the USA, as my work sometimes requires me to do, U.S. Customs and Border Protection claims to have the legal authority to image my laptop's hard drive, force me to give them any necessary decryption keys to read it, and not tell me what they're going to do with the data. Both for my own reasons and as a result of my employers' policies, I must securely delete a lot of work materials from that machine before I carry it into the USA. Similar agencies in other countries - including Canada - are known to make similar claims, so if I'm as responsible as I should be, I'll clean the system before crossing any international border in either direction.
With a non-distributed version control system, the "drive must be clean when crossing an international border" constraint means that before crossing the border, I have to check in my local revisions and delete the working copy. If those local revisions weren't ready to go to the head, I'll create a new branch. In practice that's seldom a big deal, because protection from "I might drop the laptop" means that revisions seldom remain on the drive without a check-in to the central repository for more than a couple of hours anyway. After crossing the border, whenever I next want to work on the laptop, I must restore the necessary local information, which means checking out the head or branch by downloading one revision. I may have to do this over a crummy hotel Internet connection.
Ability to work without a good Internet connection is claimed to be an advantage of distributed version control, so we might hope that distributed version control would shine in this case. Before crossing the border, I have to merge my local revisions into some other repository elsewhere. That may be more work than with a non-distributed system simply because the distributed system may be more likely to encourage me to keep a complicated structure of branched heads in my local repository on the laptop; but I really shouldn't do that for other reasons, so we can probably treat this merge as being equivalent to a check-in. Then I securely destroy the local repository; no difference there. The problem shows up when I'm sitting in my hotel room and need to re-create the local repository over the poor connection. Now I'm not just downloading the one revision I want to work on; I'm downloading every revision ever. It is true that once I have re-created the local repository over the Net, I can do a lot of repository meta-work (branching, merging, and so on) locally that would require a Net connection with a non-distributed version control system. But repository meta-work is or should be rare, especially when I'm working on my laptop from long-distance remote; it is not the common case we should be designing for, and it is not a big imposition that I can only do repository meta-work when I have a network connection.
Even though I am a maintainer and have some important uses for the entire history of development (which is the reason for having a repository at all), I don't really want to maintain that on my laptop with its limited drive space. The entire history of development is less useful to me on my laptop than elsewhere; it is wiped out and recreated much more often than the local copies on my desktop machines; and each wipe-out and re-create step is disproportionately expensive because the laptop has limited resources and often needs to use a poor Internet connection. I just want a lightweight working copy on the laptop. Current distributed version control systems, and git in particular, do not offer lightweight working copies but only clones of the entire repository.
Distributed version control sucks for long-lived projects
I've griped about distributed version control systems being wasteful of space. That is a claim we could test easily: just check out or clone two comparable projects, one from a distributed and one from a non-distributed version control system. Those will end up being git and SVN, almost certainly. Do that and you'll probably find that the SVN working copy is bigger than the git cloned repository. So, I don't know what I'm talking about and you can all go home now, right?
Well, maybe. It's a fact that SVN stores two full uncompressed copies of the latest revision when you check out a working copy: there's the working copy itself, and there's another entire copy (I think SVN calls it the "text base") which basically exists just so that you can do "svn diff" offline. This is obviously suboptimal for users like Joe who will never do "svn diff." Git for the win, evidently.
But just because SVN sucks doesn't mean git is good. What that comparison hides is a very important issue in computer science, namely the shape of the growth curve. Successful software projects grow for a finite amount of time and then are maintained, potentially forever. During the growth phase, most changes put into the repository consist of adding new code. Very little code is deleted. As a result, the code that is in the current version is a large fraction of all the code that has ever been in any version. All new information from all versions is not much more than an uncompressed copy of the current version, and if you can do good delta compression followed by general-purpose compression, then you've probably beaten SVN despite storing many more versions.
But that's only true in the growth phase. I don't like the phrase "CADT model" because I'm a long-time youth rights advocate and I don't want to insult teenagers. For that matter, there's no need to insult sufferers of attention deficit disorder either. It remains, however, that the problem exists and we all know about it: many developers think their projects will only ever have growth phases, and then will be "finished" and will not need to be maintained, and the deadbeat developers will be able to walk away and go do something more fun than taking responsibility for what they've created; and it remains that those people are wrong. If any of us are serious and write our code to last, then we need to think about what happens during maintenance.
During maintenance, developers are not just adding new code but also deleting old code. As a result, the size of the current version doesn't grow, or doesn't grow much. But the total amount of code that has ever existed continues to grow with every checkin. The history continues to grow; a single version doesn't. This may be a smooth progression rather than a sudden state change: over time it becomes more the case that the history grows faster than the current version. And so a system that forces every copy to contain all of history will eventually, inevitably, have bigger copies than a system that only stores current versions. SVN copies during the initial growth phase are larger by a constant; git copies continue to grow over time even during maintenance as SVN copies stop growing, and so they have different asymptotics. If the project survives long enough, the git copy will inevitably dwarf the SVN copy, and the overhead in time and space payable by every new user will inevitably become excessive.
We haven't really seen this happen to large popular packages run by git yet, because git hasn't existed long enough. Note that the Linux kernel has only been under git control for a little over a quarter of its history, and much of the code in it is still shiny and new. Other distributed version control systems haven't been deployed in public on a large scale for long enough either. The inevitable grind of "upgrading" means that projects are quite likely to replace their version control systems (and forget their histories, making trouble for archival use) every few years anyway, and maybe that will shield them from ever facing this issue. But "your checkouts will necessarily become too big and slow someday, if the project stays on this system long enough" is a fundamental design flaw and we should think twice about counting on "well, we'll probably switch to something else before it bites us anyway."
Distributed version control sucks for archiving
It is a fundamental concept of academic work that there is a permanent archival record of what we do. If I publish a paper and something in it turns out to be wrong, I don't get to go back and change it. One reason is that others may have already cited material in it and I'm not allowed to break their links; another is that I have to take responsibility for my mistake. So I maybe publish something new saying "sorry, my earlier paper was incorrect!" but I don't get to cause the earlier publication not to have happened. At first blush that sounds like a natural fit to what version control systems do.
A closely related concept is that there is only one authoritative real official published version of anything. The official record may be mirrored across multiple libraries, for reliability and to keep individual repositories honest, but there is only one official record. It is not "distributed" in the sense of there being multiple different but equally valid versions of the truth; there is only one correct official academic record. Academics and academic publishers care a lot about which version of anything is the official version, and maintaining the integrity of the official version is a big part of how academic publishers are supposed to earn the huge amounts of money they are paid. If you go look at my publications page you can note that it's full of notices that I'm contractually required to include, specifying to what extent different files I link to are or are not the official original versions of things. Pretty often you can download a "preprint" from my own site which is as good as or better than the "real" official version of a paper, but it's not the "real" official version and to get the "real" official version you'll have to go through a paywall and either you or your institution will have to put up a startling amount of money for the privilege.
We can argue about whether the academic world should work that way. Many people think it shouldn't. It remains, however, that I'm not in a position to cause the world to operate in some other way all by myself, and I have to live within this world even as I may hope to change it. The practical consequence is that it's really useful to me to keep track of which version of my academic work is the official one. That means at the very least "tagging" it in version control. Fine, all version control systems offer tagging.
It also means keeping long-term records, because I or others may want to refer to materials I created many years earlier. Just in the context of my own career I've cited work I published myself that was over ten years old, and I'm near the start of my academic career. I've also cited work of others that was over a hundred years old. And that's in computer science, where there isn't all that much relevant century-old work. I've a friend who works in the field of "rhetoric," which is one of the oldest academic disciplines, and he cites publications that are thousands of years old.
And scientific work requires keeping archives of things that aren't software as we know it. Much of the work I do results in LaTeX documents. Those may or may not qualify as "software" (they're written in a Turing-complete language, anyway) and they may or may not be appropriate subjects for version control. They involve collaborative editing and merging of changes made by multiple authors; they have a history of revisions; they sure look like something it would be nice to have under version control just like more conventional software code, and if I'm working on a project that also includes conventional software, it would be nice to have the LaTeX documents under the same version control as the software.
But then there are data files, too, often large volumes of them. For some of my papers I'd like to be able to check in a gigabyte or two of supporting data right there alongside the C and Prolog code that processes it and the LaTeX source of the paper describing it. When I taught a course last Summer I wanted to check all the audio recordings of my lectures into version control next to the LaTeX source of my slides. If I teach the same course again I'll want to check in new slides that will be revisions of last Summer's slides, even though the tagged "this is what I actually presented last year" version remains sacrosanct. And then I'd like to keep an archive of all this at least until I retire, and ideally be able to pass it on in useful form to my intellectual successors.
So, fine. This is basically just an extreme case of exactly what version control is meant to do. The problem comes when I want to write my next paper. Am I going to use the same repository - and if it's a distributed version control system, will I be forced to download those gigabytes of data into every clone or working copy? Is it okay to inflict that on all my collaborators who might want to access the same repository? On the other hand, do I have to start a new repository, move the old one to some really separate archive, and lose the ability to link to my old work in any easy and history-preserving way?
Ideally, I'd have one really big official repository containing all the history of my work. This official version would be heavily fortified and backed up. On any given project I'd be working on just a subdirectory within the big repository, but it would remain possible to do a lightweight copy (retaining history) of anything from the archive into my new subdirectory. It would be nice if this could be distributed too, in the sense that I'd like to cross-link with other people who may have similar repositories. But there would still need to be some way of knowing which one was the real one and which ones were just mirrors or clients, because having an authoritative official copy is a necessity. And it would be nice if I could give people partial access to my repository for collaboration purposes - you are allowed to read and write this directory, only read that directory, and not see this other directory at all.
As of version 1.7.0, git supports "sparse checkout" of just part of the repository. By downloading the entire repository and then only showing you part of it. Thanks, guys!
SVN is better but not all that much better - it does partial checkouts nicely, but it does fine-grained access control only by cumbersome and bug-prone extra stuff bolted on top, not natively. This use case is to some extent inherently distributed, so it would seem that some kind of distributed version control might actually be good; but not git, and I don't know of one that would be much better. I'm currently using SVN for this despite its suboptimality, and many of my colleagues are using Dropbox.
Although these issues could be mitigated in theory, that is not done in practice
The fact that git stores the entire history of everything in the repository in every cloned repository is relevant to most of the above; and that's an issue with git in its default configuration, not with the abstract concept of distributed version control in general (though the others do it too) nor even with git in all installations. There is no reason that a distributed version control system, even git, must be used that way. A system could be distributed without requiring all history to be reproduced everywhere. Git has the "--depth" option for cloning, which tells it to only clone the last few versions instead of the entire history. If Joe is using git to fetch what would be in the tarballs that the Frobozz developers no longer bother with, he can use "--depth" to avoid downloading and storing all the history he doesn't care about.
So why doesn't Joe use "--depth"? Joe doesn't use "--depth" because it's not the default, and Joe is going to use the default because that's what people like him do (and because the authors of Frobozz recommended so on their Web page). The "type this command to get a copy of the latest version" section of the Frobozz Web site is where the non-developer user will look, and it makes no mention of "--depth"; if Frobozz is hosted on Sourceforge then they cannot mention "--depth" there because that chunk of text happens to be Sourceforge-global and not written by the Frobozz developers. Joe is going to be stuck cloning the entire repository just to get the latest version, just because he doesn't know any better.
And even if he happens to read the "git-clone" man page carefully (which he will not do - remember that he's interested in Frobozz, not git), Joe will notice that "--depth" is described on the man page in a deprecating way. Most of the text about it consists of warnings of things you can't do if you use it, likely to dissuade users even if they really had no intention of attempting those things. We could imagine that a distributed version control system could be designed that could allow first-class participation by repositories that had "--depth" or its equivalent in effect. But git is not such a system because git wasn't designed for the overwhelming majority of users. Git was designed for Linus Torvalds.
Many of the problems associated with using a version control system (distributed or not) for distribution of software to users could be solved by automatic tarred snapshots of the current version from the version control system, either generated regularly on a schedule or on-the-fly. Github provides those, as do some other project hosting systems; so Joe doesn't really need a git client, he can just download a snapshot with his Web browser. But there we're back to depending on a centralized add-on service to provide a necessary feature that was deliberately omitted from the version control system, and it's what (though I dislike the term) could be called a design smell: if we know we need to support a centralized model of use, why are we spending effort trying to make the system distributed? It's also interesting that these kinds of snapshots are rarely if ever recommended to non-developer users as the preferred means of obtaining the latest software; instead, Joe is told to check out of version control and just suck up the inappropriateness of that for his use case. Why?
A further elaboration of that idea, which is probably the closest thing there is to a Right Answer for Joe, would be to support the "make distcheck" target in the Makefiles, and have periodic automated builds (for instance, nightly) that build a distribution tarball from the latest version in version control. Then if you really want to get fancy you can do like in that link that made the rounds recently, and build a machine that automatically launches Nerf missiles at the cubicle of whoever broke the build. Nightly builds might be desirable for engineering reasons anyway. They are common practice in commercial software shops. Then the tarball is a real constantly up-to-date distribution (not just a snapshot of the development tree), and it can be conveyed to users by the usual techniques that are not broken. The downside of this approach is that it involves a fair bit more setup and maintenance work for both humans and servers. It could be argued (like version control snapshots) that automatically-generated tarballs are another example of bolting on an extra box to compensate for deficiencies in the version control system, but I'd say that maybe it is instead a case of building a separate system for a separate task that shouldn't be part of version control at all.
I don't know what's the right answer for academic archiving. My own SVN-based archive is a lot better than the Dropbox non-solutions my colleagues are using, but it doesn't really solve all the problems I've mentioned. The git 1.7 implementation of sparse checkout is risible, but it's easy to imagine that a distributed system could be designed that would do it properly. I hesitate to suggest anyone write new code for academic archiving because then we're back to the profusion of incompatible solutions that we see with existing distributed version control. It may be possible to solve some of these problems with extra stuff bolted onto git - for instance, a succession of small repositories with external references between them and something as light-weight as possible managing the archiving of the old ones. For the moment I have to classify academic archiving as an "open problem."
Someone who wanted to defend git or other distributed systems might say that they don't force, only enable, a wide variety of development models including centralized and partially centralized models. It's better to have the ability to choose an appropriate model for the application at hand, then to have absolute centralization shoved down your throat. Indeed, the git documentation makes exactly that claim, and git in its ideal embodiment might be better called a "flexible version control system" than a "distributed version control system." But there are nonetheless assumptions built into modern version control systems about the way they will and should be used, and those assumptions actively conflict with the way people actually do use them. We should be thinking more about the systems of organization that will actually be deployed and less about the ones we think might be technically or ideologically cool.
12 comments
Matt - 2012-01-11 12:50
Jakub Narębski - 2012-02-15 11:50
Hovno - 2012-02-18 20:09
Matt - 2012-02-18 21:04
If you think about it in terms of GitHub's incentives to increase the community of people who might use their service, by increasing the number of people with at least one git repository, it becomes less surprising.
Jeremy Leader - 2012-03-14 13:00
Matt - 2012-03-14 16:22
Jeremy Leader - 2012-03-27 13:53
Matt - 2012-03-28 07:45
Zeb - 2013-01-04 15:47
Here is my proposal on how to do that.
http://zopache.com/ZoPacheVsGitAndHg
Christopher Lozinski - 2013-07-12 10:24
I think her percentage was optimistic.
Matt - 2018-07-11 10:22
Vilhelm S - 2012-01-11 12:33