[sf-lug] Another victory...

Rick Moen rick at linuxmafia.com
Mon May 15 14:52:39 PDT 2006


[A friend commented off-list about the Linux machines apparently now in
a bad state at Javacat.  Since I'd rather post publicly, I'm omitting
quoting the friend's private remarks.]

Sounds like we don't know who did what, really.  The fellow who
installed Linux on the new machines might have messed up the
installation, or someone else might have screwed around with them later.
I hope Jim does a bit of investigation before blowing away and
re-loading.

Linux machines sitting out in [semi-]public, for general use, are
inherently under _some_ ongoing threat of people monkeying with their
boot configuration, or cracking root and damaging things, etc.  The key
to their doing well is to expect this and plan for dealing with it.
This is ideally iterative, i.e., you deploy based on your best guess
about what will work, then closely observe non-expert users' problems
and take corrective measures to fix them.

E.g., I can remember a couple of changes made to the CoffeeNet machines
after the first couple of weeks:

1.  Richard had anticipated that some people would Ctrl-Alt-F1 and
Ctrl-Alt-Del in order to trigger pointless reboots, and had accordingly
reedited the "ca:" to make Ctrl-Alt-Del map to something harmless and
inert, instead of "/sbin/shutdown -t1 -a -r now".  This turned out to be
a mistake:  Those people who were determined to initiate pointless
reboots, when foiled in the above fashion, instead just yanked and
re-plugged the system power cord -- which was much more perilous to
system health.  So, it turned out to be much smarter to let them do
their dumb, pointless reboots via Ctrl-Alt-Del, which at least ensured
an orderly shutdown and umount.  (This is before the days of journaling
filesystems.)

2.  We found a few workstations apparently in a hung state, from which
the customer had walked away (sometimes saying the system had "crashed",
sometimes not), that turned out upon examination to have eight or ten
Netscape Navigator instances crammed into memory at once.  This was a
little puzzling, until we observed the syndrome:  Impatient customer
pressed the "N" (Netscape browser) button on the tkGoodStuff button bar.
Not getting instant browser pop-up, and not bothering to notice the
disk-activity light, he/she pressed it again a second later.  Then
again.  Then again.  Then again.  The script invoked by tkGoodStuff to
launch the browser wasn't smart enough to detect existing instances
under that same EUID, so it kept spawning more as requested.  Thirty
seconds later, a few browser windows started opening, with machine
performance slowing to a crawl as it swapped itself nearly to death.

The cure was, of course, to insert a few lines of logic to check for
running browser instances and terminate if any were found.  This is
probably now an obsolete problem, since I believe that default browser
wrapper scripts now routinely include such checks, but the moral is:  
Unsophisticated users will find and trigger error modes you hadn't even
realised were possible.  Only observation and corrective action will
find and defuse those pitfalls.

Richard also had to research and test some fairly arcane, poorly
documented information about XDM scripts that can be automatically run
at the beginning of login, and at logout (GiveConsole, TakeConsole), for
system maintenance purposes.  These could be caused to run as (and be
owned by) root, so that the users couldn't fool with them.  For example,
/tmp had to be cleared out upon logout, otherwise people would leave
vast amounts of junk there.


The most important decision Richard made was to locate all the
most-crucial files (mail spools, user home directories, authentication
information) on the protected NFS server upstairs in his apartment,
leaving only (very replaceable) generic distro binaries and libraries on
the workstation boxes down in the cafe.  Anyone cracking root on, or
otherwise absuing, the workstation boxes would gain nothing useful:
Because the NFS exports used the "root_squash" flag, gaining root on the
downstairs machines got you _less_ access on the upstairs server than
you had before.  Any workstation suspected of being fooled with in that
fashion would, however, be (quickly, easily) reimaged to avert mischief.

Other needs for light scripting also arose, but I can't remember
details.  E.g., every user had a specific disk quota imposed on him/her
programmatically, and it was necessary to run reports on which users had
hit quota -- because inevitably there were people who, through mailing
list subscriptions or innumerable other means had used up all their disk
space but never figured that out.  (Users would come up and say "Hey,
why is my mail being refused?" and be completely clueless about what the
"550 User over quota" delivery status notification means.)





More information about the sf-lug mailing list