[conspire] Parts is Parts

Sun Apr 13 16:38:30 PDT 2008

Quoting Don Marti (dmarti at zgp.org):

Me, I'd rather have a nice little Linux laptop in front of me as a
relatively smart terminal, and have as much as is feasible of both my
data and my substantive computing occur on my own server(s).  

Christian's situation is a rather freaky and (nearly) unique one, where
he needs -- or, it is said that he needs -- to have literally all of his
16 terabytes of video data reside locally on a Linux workstation in front of
him.  (The master copy of the raw video data, his 380 hours of video
footage,  is on some sort of tape.)  Presumably, he has or will have a
baker's dozen or so of relatively cheap, marginally reliable,
terabyte-plus SATA hard drives that will hold all of his in-process work 
material -- and wants all of that storage locally mounted on a single
machine, with no drive redundancy in order to maximise performance and
reduce hardware cost.  (That's a calculated risk, assuming the chance of 
losing editing work when drives fail, to save money on extra drives for
redundant storage.)

It's a bit difficult putting myself in his shoes, for that project,
because first of all I'm not a video guy.  Second, you'll note the
appearance of words like "presumably" and "it is said", meaning that I'm
having to speculate as to his exact situation because the data provided
so far have been uselessly vague.

However, if I _were_ in his position, I think first of all I'd do some
hard thinking about ways to eliminate the necessity to have all of that
ridiculously large data set reside on the local workstation at once.  
Even the professional CGI rendering houses for Hollywood don't end up
needing to do that:  They have a limited number of reasonably powerful
workstations in front of them, and a large number of headless, racked
Linux servers doing most of the substantive work.

Which gets me back to my point about mice and elephants:  A _large_
amount of the more severely screwed-up IT infrastructure I've seen over
the past twenty years in business has resulted directly from people
insisting on thinking like DOS/Windows users, when in fact they're being
paid to do intelligent systems design.  But people who cut their teeth
on Word/Excel point-and-drool work with Packard-Bells tend to see _any_
subsequent computing problem as merely requiring a sufficiently
scaled-up Packard-Bell.  

Thus my point.

One of my consulting clients, for a long time, was a firm that processed
most of the continent's mortgage data from banks and other financial
institutions.  I took over as network consultant from a FreeBSD guy
who'd been trying for years, fruitlessly, to end their model of pulling
down mammoth data sets from file servers onto very large NT
workstations, where all of it was crunched locally, and then the results 
mass-copied back to the file servers.  Even with gigabit everything,
their LAN infrastructure was near collapse from the traffic load, almost
all the time -- and it was a hideously inappropriate and inefficient way
to move data around -- but that was all the resident crew of MFC/C++
programmers knew about, and they were in a position to override any
advice from mere LAN/server consultants, so that's what they did.

Now, in contrast, it may be that Christian's situation is one of the
very few where true client/server LAN software architecture just cannot
work, and he really does need to have 100% of a 16+ terabyte dataset
reside locally to his console machine.  (At least he isn't proposing to
frequently copy it back and forth across a busy corporate LAN.)
However, it _doesn't_ follow that it's either necessary or desirable to
mount a dozen or so terabyte-plus SATA drives in a single workstation
machine (like a good little Packard-Bell user).

Consider the risks of so doing:

o  It puts all the drives on the same power bus, along with every other
   power-using components in the box.  This means power irregularities 
   originating from _any_ of the drives, or the motherboard, or the PSU 
   itself, can destroy any or all of the attached devices in about two 
   seconds.
o  It puts all the drives in the same cooling environment in the same
   enclosure.  Again, a device (drive or fan) seizing up and going hot
   is in a position to destroy the whole array.

And, campers, what _is_ it that destroys components long before their
expected useful lives, more than anything else?  Number one is power
irregularities on the PSU side, from either a stressed/failing/junky PSU
or some attached component.  (I've seen such events take out some or all
attached hard drives, more often than I can count.)  Second is component
stress from heat buildup.

Now, I've never had to design a workstation machine with that much local
storage, and I'm probably not likely to be very good at it without some
professional-level research and planning -- which nobody's paying me to
do, at the moment.  However, I'd sure try hard to get the drives in a
_separate_ box (or two) from the workstation motherboard, preferably
one or more enclosure designed specifically for drive arrays, maybe with
redundant PSUs, and certainly with cooling set up specifically for such
a heavy-duty system.  

And I'd look hard at a Coraid ATA over ethernet (AoE) system (especially
now that the drivers have been in the mainline kernel since 2.6.11).

Last, I'd carefully follow _your_ advice, Don, and make sure that
there's good built-in monitoring and control, to spot and act on heat
problems and find/eliminate power-wasting processes.  (And I'd make
_damn_ sure I had components with good Linux ACPI support.)

But, of course, in this case, Christian's probably already in the middle
of a some already decided-on architecture or other, so this discussion
is pretty much pointless except to make general, theoretical points,
anyway.