The most recent version of this essay can be found at http://linuxmafia.com/faq/Licensing_and_Law/forking.html.

Fear of Forking essay

original version (corrected and annotated)

[1] (Footnote on overall context of this essay. Please read.)


From rick Sun Nov 14 16:13:06 1999
Date: Sun, 14 Nov 1999 16:13:06 -0800
From: Rick Moen rick
To: [several individuals at my former firm]
Subject: Essay for the Brown-Baggers: code forking
Message-ID: 19991114161305.C32325@uncle-enzo.imat.com
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 1.0i
X-CABAL: There is no CABAL.
X-CABAL-URL: There is no http://linuxmafia.com/cabal/
X-Eric-Conspiracy: There is no conspiracy.
X-Eric-regex-matching: There are no stealth members of the conspiracy.
Status: RO
Content-Length: 20135
Lines: 381

Ed, I hope you would not mind forwarding this essay to the Brown Baggers mailing list. I was trying to finish this for the sales force's benefit before my departure, but ran out of time, in my rush to get my department in order.

WHY LINUX WON'T FORK
And why being able to fork is still A Good Thing.[2]

I noticed some puzzled faces when Nick[3] did his presentation on licences at a Brown Bag session, and talked about the right to fork source code. He pointed out that the right to start your own mutant version of any open source project (which is what we mean by "forking") is an important safeguard. He and I both stressed that the absence of that right in Sun's "SCSL" (Sun Community Source Licence), used for Java, Jini, and (potentially) Solaris[4] and Star Office is what prevents SCSL from being genuinely open source. (Borrowing a term from Eric S. Raymond, I called SCSL projects "viewable source".)

But this creates a puzzle for you guys[5]: I'll bet you have to work hard to fight customer fears that GNU/Linux [6] will fragment into a hundred incompatible versions because there's no single big corporation in charge. Right? And here Nick and I come, saying thank God open source licences guarantee everyone the right to do just that.

Sounds contradictory, right? OK, here's the quick and dirty answer. The detailed one comes later:

Linux won't fork because the fork-er has to do too much work for no payoff: Any worthwhile improvements he makes will be absorbed into the main branch, and his fork will be discarded/ignored as pointless.
The above happens with Linux, even though it hasn't with earlier projects, because of the effect of Linux's source-code licence.

NOTABLE PAST INSTANCES OF FORKING

1. Unix --> dozens of proprietary mutant corporate Unixes

If you've read up on Unix history, you know that Unix was a freak product of AT&T's Bell Labs division, around 1969. I'll omit most of the long story, but the most important fact to know is that AT&T was then operating under a January 24, 1956 Department of Justice anti-trust judgement [7] (which expired around 1980) prohibiting it from entering the computer/software business, and required it to reveal what patents it held and license them when asked. So, it could not legally sell Unix, but instead sold source-code licences (and occasionally also the right to use the trademarked name "Unix") to (1) universities, such as U.C. Berkeley, and (2) companies such as IBM, Apple, DEC, Data General, SGI, SCO, HP, etc.

Those companies bought the right to make their own Unixes: IBM released AIX. Apple did A/UX. DEC did Ultrix[8], OSF/1, and Digital Unix (later renamed "Compaq Unix" and now "Compaq Tru64 Unix"). Data General did DG/UX, SGI did IRIX, HP did HP/UX, and SCO did Xenix[9] which eventually mutated into SCO Open Server. And we could cite others, but I'll spare you.

The point is that these were the jokers who ruined Unix. Every one of them marketed his mutant Unix as "Unix plus" -- everything the other guys have and more. Needing to create differentiators, they deliberately made their Unixes incompatible while giving lip service to "standards".

For customers, this was simply a mess, and Microsoft drove right through these guys' disunity like a Sherman tank. It is the classic instance of forking that sticks in people's minds. Which is why you folks are expected to assure customers that the same won't happen to GNU/Linux. We'll return to this point later.


2. BSD --> FreeBSD, NetBSD, OpenBSD, BSD OS, MachTen, NeXTStep (which has recently mutated into Apple Macintosh OS X), and SunOS (now called Solaris)[10]

As I mentioned above, antitrust-limited AT&T, not being able to sell Unix itself, gave out very cheap Unix source-code licences [11] to universities including U.C. Berkeley. UCB's Computing Systems Research Group (CSRG) took the lead in the academic world: Having access to the source code, they quickly realised that they could rewrite it to make it much better, and slowly did so. Their rewrite was dubbed "BSD" (Berkeley Software Distribution), and they were glad to share it with anyone similarly having an AT&T Unix source licence.

And their work was generally a great deal better than Bell Labs's, partly because it benefited from worldwide peer review in a very open-source-like fashion.[12] Over quite a few years, they gradually replaced almost all of the AT&T work, without (at first) really intending to.

One fine day in 1991, grad student Keith Bostic came to the BSD lead developers, inspired by Richard M. Stallman's (remember him?) [13] GNU Project, and suggested replacing BSD's remaining AT&T work to create a truly free BSD. Dreading the confrontation likely to result with AT&T, they tried to stall by assigning Bostic the difficult part of this task, rewriting some key BSD utilities. This backfired when he promptly did exactly that. So, they grumbled but then completed the job[14], and tried to prevent AT&T from noticing what they had done.

AT&T did notice[15], panicked, and sued. That, too, is a long story best omitted. Under the stress of the lawsuit, freeware BSD split into three camps (FreeBSD, NetBSD, and OpenBSD).[16] But there were also several proprietary branches[17], made possible because U.C. Berkeley's "BSD Licence" allowed creation of those: Sun Microsystems's SunOS, Tenon Intersystems's MachTen, BSDI's BSD OS [18], and NeXT Computer's NeXTStep OS all came out for sale without public access to source, and were all based on the Berkeley BSD source code.

Note the distinction: If you write a program and release the source code under the GNU General Public Licence (GPL), other people who sell or otherwise release derived works that incorporate your work must release their source code under GPL conditions. The same is not true if you release your work under the BSD Licence: Anyone else can create a variant form of your work and refuse to release his source-code modifications. (In other words, he is allowed to create proprietary variants.)

A word about the three free BSD variants: All three were splinters from a now-dead project called 386BSD. All have talked about re-merging in order to save duplication of effort, but they now persist as separate projects because they've specialised: FreeBSD aims for the highest possible stability[19] on Intel x86 (IA32) CPUs, NetBSD tries to run on as many different CPU types as possible, and OpenBSD aims to have the tightest security possible. In other words, the 386BSD project remains forked because there are compelling reasons that make this a win for everyone.

Also, where possible, these three sister projects collaborate on tough tasks -- and they also collaborate with GNU/Linux programmers. Some of the best hardware drivers in the Linux kernel are actually BSD drivers. There's a high level of compatibility among the three BSDs and between them and GNU/Linux: Unlike the proprietary Unix vendors, BSD and GNU/Linux programmers have an incentive to eliminate incompatibility and support standards.


3.  emacs  -->  GNU emacs  
                            -->  Lucid emacs  --> xemacs
           -->  other proprietary emacsen, now mostly forgotten

The Emacs editor / programming environment (short for "editing macros") was originally written by Richard M. Stallman (with Guy Steele and Dave Moon) in 1976, in TECO macros and PDP 10 assembly, to run on ITS and TOPS-20 -- at that time, under no explicit licence terms. (Stallman has clarified that it did carry a statement that "People should send changes back to me so I could add them to the distribution.") It proved wildly popular, and by 1981 had started to give rise to explicitly proprietary variants, notably James Gosling's C-coded "Gosling Emacs". [The original version of this essay's section on Emacs forks was sadly confused, as I had confused this "Gosmacs" fork with others, in attempting to recall Emacs history solely from unaided memory, and my explanation went wrong from that point on. For this revision, I've replaced that entire section.]

In 1985, Richard Stallman resumed leadership, creating his flagship GNU Emacs version in C, based initially on Gosling's work, but replacing all Gosling code by mid-year, enabling Stallman to place the work under his newly written GNU General Public Licence, which he then did. At this point, mid-1985, Emacs's open-source history begins.

By 1991, Stallman's GNU Emacs had gone from major versions 15 through 18, with a number of point releases. NCSA originated a set of popular patches ("Epoch") to improve GUI support: GNU Emacs 19 was expected to merge Epoch's features cleanly.

So things stood as developers at Lucid, Inc. (who used Emacs with their proprietary C / C++ development tools) began participating in the GNU Emacs development effort, attempting to bring about version 19. For reasons that remain disputed (http://www.jwz.org/doc/lemacs.html), the Lucid developers and Stallman had difficulty cooperating, and the Lucid developers released their version as Lucid Emacs 19.0, in April 1992. (As a fork of GNU Emacs, it is likewise under the GNU GPL.)

The anomalous aspect of this rare fork of a GPLed work is not so much that it occurred as that it persists to this day: Lucid Emacs was renamed XEmacs in September 1994 (after Lucid, Inc. closed) and remains equally popular with Stallman's version. This appears to be a rare case of differences about working styles, design issues, and management policies outweighing the advantages of re-merging. However, even here, convergence occurs: Since much of an Emacs implementation's functionality exists as elisp macros, essentially all of that code is common to the two rival Emacs projects. And each benefits from studying the other's new features and code.


4. NCSA httpd --> Apache Web server

These days, the world's standard Web server package is the Apache package, maintained by the all-volunteer Apache Group. (That is not to say that they don't make money: Members of the Apache Group such as Brian Behlendorf have practically a licence to print cash, when it comes to Web consulting, because of their well-earned fame.)

But, before there was an Apache, you ran either the University of Illinois at Urbana-Champaign National Center for Supercomputing Applications's "NCSA httpd" (HyperText Transport Protocol daemon) or Geneva-based CERN's (Centre Européen pour la Recherche Nucléaire's) "CERN httpd". The NCSA daemon was smaller and faster, while the CERN one was famous mostly for association with the creator of the Web, Tim Berners-Lee, who worked as a researcher at CERN. [20]

CERN's httpd (later called "W3C httpd") was always under an early sort of free-software licence. It's no longer maintained -- a dead project. It's unclear what NCSA httpd's licence was originally, but when that project died (1996) its licence was a "free for non-commercial usage only" one.[21]

In any event, the story is that an on-line group of programmers who had been producing patches (modifications) for the NCSA httpd eventually decided that they'd produced their own variant in 1995, forking the code. "Apache" was originally just Brian Behlendorf's temporary code name for the project, but fellow developers then pointed out the name's appropriateness ("a-patchy" server = "apache"; get it?), and it stuck.

In any event, this is an instance of why and how open-source projects fork benignly, for good reason: Development at NCSA had stalled after the package's original creator, Rob McCool, left the Center. If that happened to a proprietary product, it would just die, leaving all its users in the lurch. However, because the product was so useful, the Apache Group forked the source code and kept driving it forward. It now dominates all Web servers, regardless of their marketing and development budgets.


5. gcc --> pgcc --> egcs --> gcc

Here's an odd one: Richard M. Stallman (remember him?) founded in 1984 the GNU Project, which produced[22] the immensely important GNU C Compiler ("gcc"). gcc is designed to work on just about any remotely feasible computer, not just the Intel x86 (IA32) series. So, it might just have been other priorities that delayed improved Intel support. Specifically, well into 1997, the best the then-current gcc 2.7 series could do for code optimisation on Intel was to set the compiler for 486 chips. People pleaded with FSF for Pentium optimisation, but were stubbornly ignored.

So, two separate groups, in succession, developed Pentium-optimised compilers as forks from gcc 2.7. The first was "pgcc", from the Pentium Compiler Group, a consortium consisting mainly of Intel Corporation staffers. pgcc produced very fast code via a two-pass process, but was completely non-functional on gcc's non-Intel platforms, and for that reason could not be accepted into the main gcc code. Further, it departed so radically from the base gcc code that it proved difficult for pgcc to track gcc improvements.

However, the Pentium Compiler Group distributed its work widely, and its Web site remained available as a major resource on Pentium optimisation issues for interested parties -- so much so that my initial version of this section, based on memories of that site, inadvertently confused the Intel/PCG work with the later egcs work (discussed below). My thanks to Ian Lance Taylor of CYGNUS for helping me straighten out the account.

Perhaps inspired in small part by receiving a copy of pgcc, but more so by a desire to make their jobs easier, improve the compiler they worked extensively with, and broaden the development model to include more developers (than just Richard Kenner, who was in charge of gcc), programmers at the CYGNUS company of Sunnyvale, California (the one that was recently bought by Red Hat Software, Inc.) independently followed up pgcc with the more-successful egcs compiler. Unlike pgcc, egcs was only a modest departure from gcc 2.7, was equally portable, and was like gcc single-pass. And it was a very clear improvement over Kenner's 2.7 and 2.8 gcc series, not just in adding Pentium support.

For whatever reason, Stallman's Free Software Foundation (developers of the GNU Project) continued to act as if egcs didn't exist. So, GNU/Linux distributions began to emerge based on egcs, and the free-software world began to mostly ignore gcc.

This can be seen as a variant on the Apache experience. The ability to fork means that progress will not be impeded by a developer not wanting to move forward: Somebody else can, as gracefully as possible, assume the leadership role and (if necessary) fork the project.

However, this necessity was averted in the egcs case. In April 1999, the FSF re-merged egcs into the (would-be) main gcc branch, and handed over all future development to the egcs team (such that egcs 1.2 became gcc 2.95), thereby resolving the conflict.


6. glibc --> Linux libc --> glibc

This is a nearly mirror-image case. Any Unix relies extremely heavily on a library of essential functions called the "C library". For the GNU Project, Richard M. Stallman's (remember him?) GNU Project wrote [23] the GNU C Library, or glibc, starting in the 1980s. When Linus and his fellow programmers started work on the GNU/Linux system (using Linus's "Linux" kernel), they looked around for free-software C libraries, and chose Stallman's[24]. However, they decided that FSF's library (then at version 1-point-something) could/should best be adapted for the Linux kernel as a separately-maintained project, and so decided to fork off their own version, dubbed "Linux libc". Their effort continued through versions 2.x, 3.x, 4.x, and 5.x, but in 1997-98 they noticed something disconcerting: FSF's glibc, although it was still in 1-point-something version numbers, had developed some amazing advantages. [25] Its internal functions were version-labeled so that new versions could be added without breaking support for older applications, it did multiple language support better, and it properly supported multiple execution threads.

The GNU/Linux programmers decided that, even though their fork seemed a good idea at the time, it had been a strategic mistake. Adding all of FSF's improvements to their mutant version would be possible, but it was easier just to re-standardise onto glibc. So, glibc 2.0 and above have been slowly adapted as the standard C Library by GNU/Linux distributions.

The version numbers were a minor problem: The GNU/Linux guys had already reached 5.4.47, while FSF was just hitting 2.0. They probably pondered for about a millisecond asking Stallman to make his next version 6.0 for their benefit. Then they laughed, said "This is Stallman we're talking about, right?", and decided out-stubborning Richard was not a wise idea. So, the convention is that Linux libc version 6.0 is the same as glibc 2.0.


7. Sybase --> Microsoft SQL Server

Woody Allen has a saying that "The lion may lie down with the lamb, but the lamb won't get much sleep". Much the same can be said of companies that enter "industry alliances" with Microsoft Corporation. One of the several slow-learner corporations to make this mistake was Sybase Corporation, publisher of the Sybase Structured Query Language (SQL) database package for numerous Unixes and NetWare. As part of the alliance, Microsoft sold Sybase to its customers, relabeled as Microsoft SQL Server, and got access to Sybase's source code under non-disclosure agreement.

Then, predictably, Microsoft broke the alliance when it had learned all it could from Sybase, and reintroduced Microsoft SQL Server as its own product in competition with Sybase. I do not know if current MS SQL Server versions are rewritten from scratch or retain Sybase code under licence terms[26], so this may not be a legitimate case of forking (let alone open source), but it's similar enough I thought I should mention it.


ANALYSIS: WHY OPEN-SOURCE FORKING IS BOTH RARE AND BENIGN

You, the reader, can fork any open source project at any time. This is absolutely not cause for alarm. Let's prove it: Get a copy of the current Linux kernel from ftp://ftp.kernel.org/. Rename it. Call it Fooware OS. Send out messages to everywhere you can think of, announcing that Fooware OS has splintered off from Linux, and great things are expected of it.

Wait for reactions. Wait some more. Listen to the clock ticking. Sort your lint collection. Open up the source code tree, think about what you might do with it, and wonder where you're going to find the time.

Well, that's a little unfair: You're probably not a programmer. Let's imagine that you are. You're a ninja programmer with mighty code-fu, a drive to succeed, and a disciplined team of programmer henchmen. So, you don't just listen to the clock tick, but get s