Date: Tue, 06 Jan 1998 02:56:34 -0800
From: Grant Boucher grantboucher@earthlink.net
Subject: Re: ALPHANT Digest V1 #431
Message-ID: 34B20DE2.E9186C04@earthlink.net

system@listserv.mke.ra.rockwell.com wrote: Aaron,

>> Damn, if only we could shake the UNIX/Linux biggotry towards
>> NT (perhaps with some more applications that a 3D rendering
>> guru would use) and it would be a swell marketting platform
>> for Alpha. To bad Linux steals the spotlight for affordability
>> on this one (and a 64bit ultra-cheap OS of all things).
>
> Maybe I have too little knowledge about NT but I cannot see
> how NT would be a good choice for this Titanic project at all.

uh, as the person who recommended, supervised, and implemented DEC Alpha
at Digital Domain, I would like to clear up a few matters....first, half
of the 160 Alpha render farm was Windows NT 4.0. Only half was linux.
Unlike the Linux machines, the NT machines and the Digital Unix servers
NEVER crashed, routed IP packets automatically (just hit the check box
under Network config for NT) and basically rang rings around the Linux
machines for ease of use, installation, and reliability. It took days of
kernel recompiles just to get the linux boxes to even barely work and they
NEVER properly routed packets (an NT machine was configured in 15 minutes
when they finally gave up on Linux). The Linux farm was unreliable and
problematic for weeks when compared with the NT farm, and this was the
SAME hardware, network etc. I am sorry to disappoint all the Linux fans
out there, but in a production environment, Linux was found to be
seriously wanting when compared to NT. NT was the ONLY operating system
during Titanic that did not crash the servers at all...EVER. Irix on the
SGIs and Linux on the Alphas both crashed DAILY...sometimes more than a
few times a day.

> In my opinion Unix/Linux is structurally far ahead for
> this kind of applications.

since these were simple Command line renderers, with simple parameters
passed to them, your comment makes no sense whatsoever...again, ONLY the
linux and irix boxes crashed during the production of Titanic...the NT
boxes were the most reliable on the production...period.

> The openness of especially Linux makes everybody can see
> what could be made better, everybody can help with the
> debugging of applications.

huh? you are really reaching here...Linux is a shareware OS and the
decision to risk the biggest film of all time on it was a terrible
mistake in my opinion.

> Windows-NT is still a PC-operating system, nothing more and
> nothing less. Of course you can use it as a file server
> (Already possible with Windows For Workgroups) but I know
> of no one who lets run tasks on another Windows-NT machine
> over here! So even if it is possible nobody uses it and it
> will not be developed much.

big mistake...Windows NT is a totally different animal than Windows
95 and Titanic would not have delivered without it. I suggest you
take a closer look at it. Linux is a shareware version of an
antiquated OS from the 1970s...nothing more, nothing less. :}

>> No one ever mentioned that Lightwave 3D was the primary
>> software to run on those Alphas.

LightWave was the ONLY software running on the NT farm and NT
workstations. The choice of linux for the other farm was merely a
convenience for two programmers (the ones who wrote the article), who
could have easily ported command-line code to NT as well as Linux. This,
and other similar decisions, cost the facility (and actually Fox) a
fortune in time and lost productivity as every time the linux machines
bombed out, dozens of compositors were left in the lurch (every one of
them being paid very high rates per hour mind you). The only problem
exhibited by the LightWave/NT machines came from the render control
software, which we just replaced when it became clear that the control
software was "found wanting". This problem was not the least bit OS
related. In fact, one of my favorite Linux moments was one of the authors
of the article asked the NT sysadmin "how many OS related crashes do you
get a day?" The answer was, of course, "None" because neither of us would
have recommended NT machines on a production like Titanic if they weren't
100% reliable. Perhaps he was trying to see if the hardware was to blame.
The author, puzzled, decided not to tell us how many times per day the
Linux OS was crashing. :}

I am sure that someone with enough experience and technical knowledge
could have configured the Linux farm to work as flawlessly as the DEC Unix
and Windows NT machines, but the simple fact is that the NT machines
practically configured themselves, ran flawlessly from the minute they
were powered on, and still are. And ANYBODY could have set them up...all
of the production NT machines were administered by one person...and he had
plenty of free time on his hands to get really good at Bust-A-Move on the
N64. Now THAT'S reliability! In fact, the only problems we ever had with
the NT machines was the fact that we had to cripple their networking
because the D2 SGIs were over a year and a half out of date with Irix OS
revisions, meaning we had to run NFS2 instead of NFS3 for
everything...YEESH! Sidebar -> Now, it seems as though installing a Samba
client on the SGI is the smartest, fastest, cheapest (free!) way to get
Irix to NT connectivity.

I know this post comes off as rather harsh, but you have no idea how much
of a cluster-f**k the whole Linux thing turned out to be. This is one of
THE principle reasons our new FX facility is entirely NT...period.

Hope this clears things up...my intention is not to start a flame war, but
merely to make sure the TRUTH gets out.

Peace.

Grant Boucher
Formerly, Digital FX Supervisor, Head of the Windows NT Division, and
Digital Titanic Technical Supervisor, Digital Domain
Presently, CEO of station X studios, LLC.


Date: Tue, 27 Jan 1998 10:10:33 -0800
From: Daryll Strauss daryll@d2.com
To: fritzs@mcm250.mcm.edu, jeff@snoopy.gwr.com, hahn@neurocog.lrdc.pitt.edu,
sopwith@cuc.edu, jalderson@gph.com, redhat-list@redhat.com
Subject: [daryll: Digital Domains use of Linux on Titanic]
Resent-Date: Tue, 27 Jan 1998 14:11:06 -0500 (EST)
Resent-From: Elliot Lee sopwith@cuc.edu
Resent-To: linuxnet@cabi.net

[I left the header from Grant's message included below so that you can
trace the original conversation. He posted this to the alpha-nt mailing
list January 6th. It seems that people are forwarding his message again
while neglected to include my followup to this discussion. I've included
that response below. This was my final posting on this thread. There was
another posting from Grant after mine, but I felt it had degenerated to
a level that further responses would not be useful. You are welcome to
look it up yourself if you are so inclined. - |Daryll]

-----Forwarded message from Daryll Strauss <daryll>-----

Message-ID: 19980107185209.60060@jolt
Date: Wed, 7 Jan 1998 18:52:09 -0800
From: Daryll Strauss <daryll>
To: alphant@listserv.mke.ra.rockwell.com
Subject: Digital Domains use of Linux on Titanic
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.85
Organization: Digital Domain


I felt like I needed to address some of the comments Grant has made
about our Linux Alpha cluster. I'm trying to avoid this becoming a flame
war and instead just concentrate on the facts of the case.

- |Daryll

From: Grant Boucher grantboucher@earthlink.net
Sent: Tuesday, January 06, 1998 5:57 AM
Subject: Re: ALPHANT Digest V1 #431

GB> uh, as the person who recommended, supervised, and implemented
GB> DEC Alpha at Digital Domain, I would like to clear up a few
GB> matters....

Grant was digital artist at Digital Domain. The official decisions
about the purchase of the systems were made by our director of
technology. I did the installation of the cluster, and implemented the
Linux portion of the cluster.

GB> first, half of the 160 Alpha render farm was Windows NT 4.0.
GB> Only half was linux.

Half the machines were Linux originally, until they (the Titanic crew)
found that the NT boxes really weren't as useful. The 105 machines I
quoted in my article was the configuration roughly one third of the way
into the project. 40 machines were converted from NT to Linux.

GB> Unlike the Linux machines, the NT machines and the Digital Unix
GB> servers NEVER crashed, routed IP packets automatically (just hit
GB> the check box under Network config for NT) and basically rang
GB> rings around the Linux machines for ease of use, installation,
GB> and reliability. It took days of kernel recompiles just to
GB> get the linux boxes to even barely work and they NEVER
GB> properly routed packets (an NT machine was configured in 15
GB> minutes when they finally gave up on Linux).

First, the NT boxes did crash. The systems administrator for the
NT boxes I'm sure would attest to that. Unfortunately, they don't
report their uptime, and were silently rebooted. So, there really
isn't a measure of how reliable the NT boxes were. I do think
they remained up more than the Linux boxes for reasons I've
explained later.

Second, we run a slightly unusual network. I did have trouble
with the FDDI card under Linux. We opted not to use it because
of the problems, but also because we could spare the NT boxes
(they weren't being heavily used), and it was a solution that
minimized downtime. We were very busy, and it was the expedient
solution. The other problem is that the NT box did route
packets, but not very quickly. The overall performance was not
very good for the speed of the link.

Third, I did describe in my article the troubles we had with
that version of the Linux kernel. They weren't minor, but we
did manage to resolve them relatively quickly. As I mentioned,
I believe most of them would not be true for current users.

GB> The Linux farm was unreliable and problematic for weeks
GB> when compared with the NT farm, and this was the SAME
GB> hardware, network etc. I am sorry to disappoint all
GB> the Linux fans out there, but in a production environment,
GB> Linux was found to be seriously wanting when compared to
GB> NT. NT was the ONLY operating system during Titanic that i
GB> did not crash the servers at all...EVER. Irix on the
GB> SGIs and Linux on the Alphas both crashed DAILY...
GB> sometimes more than a few times a day.

I'm not sure where Grant got his numbers about downtime. Perhaps
he is extrapolating from the initial setup. Once the machines
were up and configured they worked very reliably. The machines
are still in heavy use, and have an average uptime of around
60 days.

The most common cause for crashes was environmental conditions.
Unfortunately, we under-equipped the air conditioning in the
room, and the outside air temperature approached 110 degrees,
in some places. A few of the processors that were being used
in that area died (quite understandably). In one of those
places a couple of the Linux boxes died, the NT boxes in
those areas stayed alive. That was because the Linux boxes
were being heavily used, while the NT boxes sat idle.

The other crash that was more serious for Linux was caused
by bugs in the NFS implementation. When a Linux box was
being actively used and the SGI server went down, this caused
the NFS implementation on Linux to hang. This was a serious
problem for us that sometimes required resetting the
machines. This was also a fairly infrequent occurrence. I'd
estimate once every couple weeks. Again, I believe current
versions would not have these problems.

GB> since these were simple Command line renderers, with
GB> simple parameters passed to them, your comment makes no
GB> sense whatsoever...again, ONLY the linux and irix boxes
GB> crashed during the production of Titanic...the NT
GB> boxes were the most reliable on the production...period.

The problem with the NT boxes is that they never got a
reasonable NFS implementation. The NFS on the NT Alphas was
extremely slow. The lack of support for symbolic links made
using our disk space effectively very difficult. The
limitation of 26 mounted drives was insufficient. We
avoided this problem in the most expedient way possible.
We dedicated NT file servers and moved all the NT data to
those file servers; that way they didn't have to interconnect
with the rest of the NFS environment. They could remain their
own isolated NT solution.

>> The openness of especially Linux makes everybody can see
>> what could be made better, everybody can help with the
>> debugging of applications.

GB> huh? you are really reaching here...Linux is a shareware
GB> OS and the decision to risk the biggest film of all time
GB> on it was a terrible mistake in my opinion.

Linux is, of course, a freely available operating system.
Having source allowed us to fix problems we encountered that
we could not have done with a standard commercial OS. Of
course, we would hope we don't have problems to fix, but
frankly that never happens. There are bugs in every OS, and
our environment stresses the operating systems.

GB> big mistake...Windows NT is a totally different animal
GB> than Windows 95 and Titanic would not have delivered
GB> without it. I suggest you take a closer look at it.
GB> Linux is a shareware version of an antiquated OS
GB> from the 1970s...nothing more, nothing less. :}

Well, this is obvious bait. So I won't address much. I agree
Window95 and WindowsNT are entirely different animals. Linux
is a very modern operating system, and many of the technologies
are very current in operating systems.

GB> LightWave was the ONLY software running on the NT farm
GB> and NT workstations. The choice of linux for the other
GB> farm was merely a convenience for two programmers (the
GB> ones who wrote the article), who could have easily ported
GB> command-line code to NT as well as Linux. This, and other
GB> similar decisions, cost the facility (and actually Fox)
GB> a fortune in time and lost productivity as every time
GB> the linux machines bombed out, dozens of compositors
GB> were left in the lurch (every one of them being paid
GB> very high rates per hour mind you). The only problem
GB> exhibited by the LightWave/NT machines came from the
GB> render control software, which we just replaced when it
GB> became clear that the control software was "found wanting".
GB> This problem was not the least bit OS related.

Lightwave was used on the NT systems.

The choice of Linux was made for a number of reasons. The
primary one was integration into the rest of our facility. The
ease of porting our applications did come into play. Our
distributed rendering system and compositing system were much
easier to get running under Linux than NT. Since then, we have
ported those applications, as it makes the NT systems more
productive.

Not having an effective means of distributed rendering on
the NT boxes was a serious problem. That was not the case for
the Linux boxes.

GB> In fact, one of my favorite Linux moments was one of the
GB> authors of the article asked the NT sysadmin "how many OS
GB> related crashes do you get a day?" The answer was, of
GB> course, "None" because neither of us would have recommended
GB> NT machines on a production like Titanic if they weren't
GB> 100% reliable. Perhaps he was trying to see if the hardware
GB> was to blame. The author, puzzled, decided not to tell us
GB> how many times per day the Linux OS was crashing. :}

As I said before, we definitely had environmental problems in
the room. The question I asked was related to that fact. My
choice of operating system was not stopping systems to the point
that they wouldn't boot the ARC console. Other shutdowns were
also diagnosed by our vendor as heat problems. The fact that NT
never failed this way indicates it wasn't being used as heavily.

GB> I am sure that someone with enough experience and technical
GB> knowledge could have configured the Linux farm to work as
GB> flawlessly as the DEC Unix and Windows NT machines, but the
GB> simple fact is that the NT machines practically configured
GB> themselves, ran flawlessly from the minute they were powered
GB> on, and still are. And ANYBODY could have set them up...all
GB> of the production NT machines were administered by one person...
GB> and he had plenty of free time on his hands to get really good
GB> at Bust-A-Move on the N64.. Now THAT'S reliability!

I agree that Alpha Linux is still too hard to use, in general. The
Intel version is much simpler, and new OS releases make the
installation process even easier. In our case, my engineering
time was cost-effective compared to buying ANY OS on those machines.

We have no way to measure the reliability of the NT stations as
they don't record their uptime. They also don't record their
usage, which made our billing process very difficult. Since the
Linux users could easily identify, report, and avoid problems
which allowed their complaints were handled quickly and efficiently.

By the way, I was the only person to support and manage the Linux
machines. I had enough time to continue my normal job of writing
software for the rendering of Titanic, while doing it.

GB> In fact, the only problems we ever had with the NT machines
GB> was the fact that we had to cripple their networking because
GB> the D2 SGIs were over a year and a half out of date with Irix
GB> OS revisions, meaning we had to run NFS2 instead of NFS3 for
GB> everything...YEESH!

I already listed the numerous performance and stability problems
we encountered using NFS on the NT boxes. At the time our IRIX
was not the latest, but it was interoperability problems between
NT NFS and IRIX NFS that caused the problem. There was no proof
that the IRIX OS was the cause of that interoperability problem.

GB> Sidebar -> Now, it seems as though installing a Samba client
GB> on the SGI is the smartest, fastest, cheapest (free!) way to
GB> get Irix to NT connectivity. I know this post comes off as
GB> rather harsh, but you have no idea how much of a cluster-f**k
GB> the whole Linux thing turned out to be. This is one of THE
GB> principle reasons our new FX facility is entirely NT...period.
GB> Hope this clears things up...my intention is not to start a
GB> flame war, but merely to make sure the TRUTH gets out.

Digital Domain and its technical staff were quite pleased with the
performance of Linux. The work on this show would have cost
substantially more if we had not been able to use it effectively.

Samba works fairly well. Again, it allows NT to remain isolated
and not interoperate with the rest of the facility. I'm not
convinced this is the best solution, but it does avoid the problem
of not having an adequate NFS implementation.

If you ask current employees and the management at Digital Domain,
I believe they will all tell you that Linux was a success. Unlike
Grant's response, my article was read and reviewed by the
management at Digital Domain.

GB> Peace.
GB> Grant Boucher
GB> Formerly, Digital FX Supervisor, Head of the Windows NT Division, and
GB> Digital Titanic Technical Supervisor, Digital Domain
GB> Presently, CEO of station X studios, LLC.

Digital Domain has never had a "Windows NT Division", nor would
we want one. Basing a division or a company on the choice of operating
system their computers run would be foolish. We use whatever tools
make the most sense for the task at hand. We continue to use NT, and
to port applications to NT. My personal opinion is that NT will
be more important at our facility over time.

Opinions will always differ between two individuals, even those who
witness the same events. One has to look at the credibility and
biases of the person making the claim, as well as their relationship
to the facts presented. I tried to provide a fair and even coverage
of our experiences. I believe I'm in a reasonable position to
address these topics.

Daryll Strauss
Manager, Software Development
Digital Domain

PS. I'm not a usual reader of this list. I'm going to remain subscribed
for a while to partake in the current discussions. Also feel free to
mail me directly, if you are so inclined.