[conspire] Sat, 1/10 Installfest/RSVP

Tue Jan 13 01:54:41 PST 2009

Rick Moen:
> Making it work on Hardy in the sense that it be built by typing
> "fakeroot dpkg-buildpackage" (or whatever variant on that you use
> locally) should also suffice to make it successfully build, install,
> and run in any other binary environment.  I mean, that's part of what
> we have self-hosting distributions and dependency-tracking _for_.

And lots of the dependency work is real trial and error.  Build-time
errors are separate from runtime errors.  Doing bisection on
incompatibilities on a running production service is plain Not Fun.
Even the official packages have to all be tested in combination for the
kind of Just Works integration *within* the set of all Debian packages
that you describe (hence the quarantine for Testing).  

Again, the power of apt is the result of a policy process, not a magic
piece of technology.  That process can be a real ball of hair while it's
still in motion.  Most people avoid the hairball by sticking to software
from official repositories and not rolling out anything custom (and not
using packages with upstream policies like rdiff-backup's *grumble*).

Also I'd really recommend you look into pbuilder for clean builds of
your packages.  That'll make a pristine chroot and basically do much of
what a proper official Debian/Ubuntu buildd does.

> …Deirdre's PHP4 code for BALE had broken because MySQL suddenly ceased
> working without explicit COMMIT statements…
[…]
> So, once in a very long time, you get unexpected breakage following an
> upgrade, and have to chase it down.  I just haven't yet seen this be a
> big deal.

That's because BALE doesn't have the same sort of SLA as what I'm
describing.  And when you increase the number of systems and services by
an order of magnitude, it becomes a more common occurrence.  It sounds
like at your scale, it's infrequent enough as to not be a significant
concern.  Other problems probably interfere with production services
more often than this, making them the focus of attention.

And you at least had a simple chain of communication about the BALE
problem, too.  You didn't have different departments each assuming
they'd done the right thing and that the other department was the group
who broke the public Web site that was just given mention in _The
Journal of Record and Popular Gossip_.  It was you and Deirdre, likely
sitting side-by-side, debugging the problem and rolling out a fix.

I agree that reasonable people could continue to operate at that larger
scale, keeping the priorities and techniques you employ, and do a great
job at it.  But I don't think it's fair to characterize being cautious
about the above scenario as "foolishness".

> If I had hundreds of machines and an obligation to extremely high
> levels of availability, I'd probably use something like the golden
> master system to checkpoint changes -- or at least a private
> flow-through repository on which I did some local QA, filtering
> upstream.

And that is a reasonable approach, to be sure.  Maybe do this a dozen or
so times: one for each class of server, to isolate variables during the
testing/QA phase.  Roll out to the dogfood/testing servers first, refine
as you go.  

You still have a bit of a struggle with the "Congratulations in-house
developers: you're writing for a constantly-moving target!" message, but
I certainly wouldn't characterize this approach as "foolishness" either.

> Nothing all _that_ special.  I make admin queues time out pretty
> quickly on the Mailman end.  

This may be the key step I was missing, I think.  I'm doing all the same
things with postfix you're doing with exim4 here.  I'll look into tuning
that a bit on my private server.  Thanks!

-- 
"There should be a homonym exam before people are         Nick Moffitt
issued keyboards."     -- George Moffitt                 nick at zork.net