[conspire] Sat, 1/10 Installfest/RSVP

Tue Jan 13 02:29:08 PST 2009

Quoting Nick Moffitt (nick at zork.net):

> Again, the power of apt is the result of a policy process, not a magic
> piece of technology.

Quite.  And functional policy is what tends to prevent exactly the sort
of theoretically-possible libs problem you mentioned.

Anyway, I stay on testing/unstable in part to obviate the most obvious
source of madness you describe:  backporting, which you cited in one of
your original messages.  Man, what a time sink, and so easily avoidable
in its entirety.

And I carefully stay away from problem code that I don't need.
rdiff-backup?  No thanks.  

> Also I'd really recommend you look into pbuilder for clean builds of
> your packages.

OK, but the only thing I'm obliged to build locally at the moment is
leafnode -- and that's so ridiculously simple that, to date, I haven't
even bothered to debianise it.  Since it relies just on one lib, I just
said "Screw it.  We can do ./configure && make && make install, just
this once."

> > ???Deirdre's PHP4 code for BALE had broken because MySQL suddenly ceased
> > working without explicit COMMIT statements???
> [???]

Yes, exactly as I said.  There was a time when the monthly cronjob that
populated the MySQL events table from the template table suddenly and
mysteriously broke:  There was nothing wrong with the Python that ran
it, and the SQL that it generated _used_ to work.  What broke the latter
was a system MySQL upgrade to a newer version that no longer honoured
implicit COMMIT statements there were initially relied on, in our
cronjob code snippet.

That is, the function definition

  def create_events(cursor):

is now followed immediately by

  cursor.execute("start transaction")

and the entire function is now terminated by

  cursor.execute("commit")

At the time Deirdre originated the code, those sorts of explicit
transaction delimiters were optional.  Later, they weren't, and neither
of us had noticed the change in the underlying MySQL engine, nor (in
particular) that change's relevance to the code.

> That's because BALE doesn't have the same sort of SLA as what I'm
> describing.

That's also because BALE has zero staffing (zero funding, too, but
that's beside the point).  Frankly, there were bigger lingering problems
with the system that weren't being attended to and should have come
first.  The nastiest of them are still present, a couple of years later,
i.e., this system is still running on an 11-year-old VA Research model
500 with 256 MB total RAM and the same original pair of 9GB SCSI drives
it shipped with -- which drives are continually running out of space,
requiring me to shift a few more things around to make room.

It's not that I can't afford better hardware:  I have some.  I just
haven't had the consecutive time and stamina and
concentration-while-being-left-alone-without-being-pestered required to
get that done.

So, don't tell me that BALE having broken code for a while reflects "not
having the sort of SLA as what I'm describing":  That's literally true,
in spades, _but_ misses the point:  The entire host system 
is something operating entirely outside of the world of resources,
service commitments, economics, and related metrics, that continues to
operate based on pure benign neglect, stubbornness, dumb luck,
strategically chosen areas of laziness, and a few careful policies
designed to apply limited time where it matters most, and limit damage
when things go wrong.

The server machines I ran at $EMPLOYER were a different matter entirely.
And there, I _did_ have pretty tight SLAs -- and met them, and was
confident of doing so.

  And when you increase the number of systems and services by
> an order of magnitude, it becomes a more common occurrence.  It sounds
> like at your scale, it's infrequent enough as to not be a significant
> concern.  Other problems probably interfere with production services
> more often than this, making them the focus of attention.
> 
> And you at least had a simple chain of communication about the BALE
> problem, too.  You didn't have different departments each assuming
> they'd done the right thing and that the other department was the group
> who broke the public Web site that was just given mention in _The
> Journal of Record and Popular Gossip_.  It was you and Deirdre, likely
> sitting side-by-side, debugging the problem and rolling out a fix.
> 
> I agree that reasonable people could continue to operate at that larger
> scale, keeping the priorities and techniques you employ, and do a great
> job at it.  But I don't think it's fair to characterize being cautious
> about the above scenario as "foolishness".
> 
> > If I had hundreds of machines and an obligation to extremely high
> > levels of availability, I'd probably use something like the golden
> > master system to checkpoint changes -- or at least a private
> > flow-through repository on which I did some local QA, filtering
> > upstream.
> 
> And that is a reasonable approach, to be sure.  Maybe do this a dozen or
> so times: one for each class of server, to isolate variables during the
> testing/QA phase.  Roll out to the dogfood/testing servers first, refine
> as you go.  
> 
> You still have a bit of a struggle with the "Congratulations in-house
> developers: you're writing for a constantly-moving target!" message, but
> I certainly wouldn't characterize this approach as "foolishness" either.
> 
> > Nothing all _that_ special.  I make admin queues time out pretty
> > quickly on the Mailman end.  
> 
> This may be the key step I was missing, I think.  I'm doing all the same
> things with postfix you're doing with exim4 here.  I'll look into tuning
> that a bit on my private server.  Thanks!
> 
> -- 
> "There should be a homonym exam before people are         Nick Moffitt
> issued keyboards."     -- George Moffitt                 nick at zork.net
> 
> _______________________________________________
> conspire mailing list
> conspire at linuxmafia.com
> http://linuxmafia.com/mailman/listinfo/conspire