[conspire] Waah, my little server crashed...
Ed Biow
biow at sbcglobal.net
Thu Nov 9 21:31:03 PST 2006
> Message: 1
> Date: Sat, 4 Nov 2006 13:45:44 -0800
> From: Rick Moen <rick at linuxmafia.com>
> Subject: Re: [conspire] Waah, my little server crashed...
> To: conspire at linuxmafia.com
> Message-ID: <20061104214544.GL26620 at linuxmafia.com>
> Content-Type: text/plain; charset=us-ascii
>
> Quoting Ed Biow (biow at sbcglobal.net):
>
>
>> I have a little Debian Sarge machine that I generally leave on all the
>> time, a $107.00 VIA Samuel 2 Asus Terminator jobbie that does yeoman
>> work as my local http, ftp and file server, plus light desktop duties.
>> (It is a bit pokey despite 512 MB of SDRAM, so it isn't my preferred
>> workstation). But it is handy and very reliable, and hopefully goes
>> easy on the juice. This morning I tried to access it from another box
>> and it wasn't responding, so I went downstairs and, lo and behold it was
>> off.
>>
>
> Well, that's a real poser, because normally _software_ problems would
> not cause the machine to shutdown and power off. They might make the
> machine hang with a kernel panic message, or have critical processes
> segfault, or just seize up and give no indication of what's wrong, or
> reboot -- but all of those fault outcomes would tend to leave the
> machine verifiably powered up although not necessarily "running" in the
> functional sense.
>
> So, I'm concluding that it's pretty definitively a hardware problem.
> Of course, it could have been a one-time thing.
>
>
>> Anyway, I'm trying to figure out why it shut down, whether it is a
>> failing component or a OS glitch or just a momentary lapse of power.
>> I figure the first place to look is /var/log, but I really don't know
>> where to look.
>>
>
> Indeed, /var/log/messages often doesn't have a lot other than time marks
> in it. syslog is worth skimming just to be thorough, maybe daemon.log,
> dmesg, kern.log. However, don't be surprised if the root cause simply
> wasn't visible to your operating system and software, because it's at a
> hardware level that's not software-visible. For example, Deirdre just
> mentioned to me that this could easily be a sign of a weak or failing
> power supply unit (PSU)
>
>> Since the system is on all the time I'm thinking maybe the drive is
>> beginning to have problems, so I'd like to check drive integrity.
>> Should I check the hard drive surface using the proprietary utility that
>> came with my disk? Of should I reboot to a live CD and run something like:
>>
>> fsck -t ext3 /dev/hdaX
>>
>
> It's always good to know how to check hard disks. "fsck -c" (which not
> only runs the badblocks utility, but also makes sure that any bad blocks
> found are mapped out and not used prospectively). "Hard Drive Utilties"
> on http://linuxmafia.com/kb/Hardware has hyperlinks to all of the
> manufacturers' HD-diagnosis utilities for their models, and those are
> worth knowing about, as well. In addition, the smartmontools can listen
> in on your HD's internal self-checking routines, and help track HD
> health and predict failure.
>
> However, none of those are very likely to be relevant to your problem,
> because I just cannot easily conceive of a hard drive problem that would
> cause the machine to power off. HD issues tend to have completely
> different sorts of symptoms.
>
>
>> Maybe I should complement that with a nice couple of hours round of
>> memtest, as well.
>>
>
> Again, you could reasonably let memtest86 run overnight from, say, a
> Knoppix live CD, but I really doubt that memory problems are your root
> cause. Memory problems can cause random reboots, segfaults, SIG11
> errors, silent data corruption, or other runtime weirdness. Memory
> problems can even cause the machine to not power on, or produce no
> video, when you hit the power switch. However, I'm not aware of a RAM
> problem that would cause the machine to power off.
>
>
>> Or would the path of prudence be to just back up my data and hope it
>> doesn't happen again?
>>
>
> 1. Backing up your data is good on its independent merits. 2. If it
> happened once and never again, then just blame sunspots or a Disturbance
> in the Force and worry about global warming, instead.
>
> Remember, once is accident. Twice is coincidence. Three times is enemy
> action: You might have a new motherboard or PSU in your future. Or not.
> But don't rush out and start buying new parts until you have more to go
> on.
>
> (I could be talking out /dev/ass, so use your own best judgement, not to
> mention your eyes and ears, which are often your best diagnostic tools.)
>
Thanks for the feedback, folks, I was preoccupied with other issues the
last few days, so it took me a while to run some tests and reply.
Anyway, the machine has been on since the mystery shut-down without a
glitch, so I'm going to attribute it to dybbuks unless it occurs again.
Memtest & fsck didn't reveal anything grim. I looked over
/var/log/daemon.log.0 and didn't notice anything other than that my ntp
daemon seems to be querying some ntp server about once an hour, which
seems to be excessive, but I've got other fish to fry and will look in
to that later. The kernel.log & dmesg didn't reveal much either.
My best guess is that the power supply is becoming a little hinckey from
being on 24/7. Although the CPU doesn't eat much juice, being a VIA C3,
the power supply is only 165 watts. And since the case is a small form
factor, I can't just drop in a standard more powerful ATX PS.
http://www.newegg.com/Product/Product.asp?Item=N82E16856110056&ATT=56-110-056&CMP=OTC-Froogle
If it happens a couple more times I'll try my ATX power supply tester if
I can find it. Or just plug in a regular ATX PS and run it with the case
open. Not really an elegant solution. If it is still weird I guess it
would be a motherboard issue (bad caps?), in which case the rig will go
to chip heaven.
Maybe it will give me an excuse to upgrade to something that has a
little more zip for desktop work. The VIA C3 really IS a little anemic
for some purposes. Video playback is dreadful. Flash playback is really
choppy. Certainly the lousy UniChrome onboard video doesn't help a lot,
but it works fine on my KM266 chipset Athlon XP 1600 system with Kubuntu
Edgy (even a software TV card & Xinerama works fine on that one, though
I haven't tried recording on it yet).
I sure don't want to run a power hungry PIV 24/7 just to download
bittorrent distros and serve files. Does Linux ACPI throttle back power
use very well? If Linux ACPI works well enough I could just transfer my
servers over to one of my more powerful Athlon or Sempr0n computers. If
ACPI doesn't throttle back the juice sufficiently I could just run BOINC
or some other distributed client so I can at least do something socially
useful with all those extra cycle.
Does anyone have a suggestion for a power-thrifty rig with acceptable
desktop performance (extra points for quiet & small) without shelling
out $500 for a MacMini, banish the thought?
>
> ------------------------------
>
> Message: 3
> Date: Sat, 4 Nov 2006 23:02:08 +0000
> From: Nick Moffitt <nick at zork.net>
> Subject: Re: [conspire] Waah, my little server crashed...
> To: conspire at linuxmafia.com
> Message-ID: <20061104230207.GO5074 at zork.net>
> Content-Type: text/plain; charset=us-ascii
>
> Rick Moen:
>
>> Well, that's a real poser, because normally _software_ problems would
>> not cause the machine to shutdown and power off. They might make the
>> machine hang with a kernel panic message, or have critical processes
>> segfault, or just seize up and give no indication of what's wrong, or
>> reboot -- but all of those fault outcomes would tend to leave the
>> machine verifiably powered up although not necessarily "running" in
>> the functional sense.
>>
>> So, I'm concluding that it's pretty definitively a hardware problem.
>> Of course, it could have been a one-time thing.
>>
>
> The possible corner case here, of course, would be a bug in the ACPI
> stack.
>
> The possibility is unlikely enough, however, that I'd concur with Rick's
> assessment. Hell, it's more likely to be a bug in the ACPI hardware
> than in anything software-related.
>
> Your usual Linux critical failure mode results in either failed system
> calls and oopses (which may kill an individual service daemon or user
> process), or a panic which will wedge your system hard but leave all the
> fans spun up and the lights on.
>
> Not being funny, but I don't suppose you had a nearby wall-powered
> digital clock start flashing 12:00 as well? It could have been the
> result of a power interruption and a strict BIOS setting.
>
No, I thought about that one. All the digital clocks in my bedroom,
oven, & microwave seemed OK, though I didn't mention it. I suppose it
could have been a brown-out that didn't quite affect them, though.
Probably VIA & Asus ACPI support is quite adequate. And it has been
running Sarge since it was released without incident (but not
continuously, if gets shut down over night occasionally). Maybe now that
the Democrats are in charge of Congress again the machine will behave.
Thanks again,
Ed
More information about the conspire
mailing list