[sf-lug] semi-OT: help with dying disk

Sat Mar 15 00:34:42 PDT 2008

Quoting matt.price at utoronto.ca (matt.price at utoronto.ca):

> rick, your posts suggest that the disk's manufacturer makes a big  
> difference hwen you're on a laptop -- for some there'll be no  
> satisfactory solution, for others you'll want to set an aggressive  
> power management in the software, for still others you really DON'T  
> want to do that.

Sounds about right!

> can you suggest a resource for deciding what brands  
> to use, or give some advice yourself?

Er, you sort of caught me unprepared with that utterly excellent
question.  That is, I really should have worked out a proper answer to
it, but I haven't.  I have a couple of ideas and a prejudice.  

The prejudice is that you should search the Net for what people have to
say (e.g., search for a particular model number plus "linux"), _but_ be
extremely skeptical, because the Net is full of Linux users with
peculiar ideas, urging bad solutions, promoting mistaken theories, and
jumping to conclusions.  (Web discussion forums are particularly bad, in
this area.[1])

One idea is that, for hard drives _you already have_ (as opposed to ones
you're contemplating acquiring), /usr/sbin/smartctl from the
Smartmontools is your friend:  It gives you a lot of information.  
(I realise you said 'what brands to use".  I'll get to that point
later.)

Point 1.  
The good news is:  It gives you a lot of information.
The bad news is:  It gives you a lot of information.
(Run "/usr/sbin/smartctl -a /dev/hda | more", to see what I mean.)

I see a lot of people out there over- and misinterpreting the
information reported by smartctl -- and in fact it's easy to
misinterpret or misjudge what it's saying.  For example, some of 
smartctl's output is _predictive_, e.g., it predicts HD failures 
based on manufacturer-issued duty cycle statistics.  smartctl's 
failure predictions that are based on duty cycles thus do _not_
necessarily indicate any unit-specific information about the HD's
health:  They're just reporting manufacturer-predicted averages.

Point 2.
_Some_ parts of smartctl output are really useful, but you'll have to
read a bit to figure out which they are, and how to interpret them.
E.g.:

#   smartctl -a /dev/hda | grep Load_Cycle_Count

...reports a number that's been much-discussed in the recent "Feisty
Fawn is destroying my hard drives" hoo-hah, the load/unload cycle count.
Arguably, if that number is rising too quickly, which means ACPI power
management is stopping the HD and parking its drives _really_ often,
then you should consider measures like the famous "hdparm -B 255 /dev/hda",
which disables Advanced Power Management measures for the drive entirely
(at some cost in battery life).

Point 3.
Some of the more valuable smartctl data is _cumulative_, i.e., is
based on data gathered from the drive over its life by periodic checks
using smartctl that get logged and then used to estimate drive longevity
and spot developing patterns that tend to indicate impending drive
failure and give you time to move your data elsewhere.  (You should have
current backups anyway.  smartctl is not a substitute.)  Thus, the tool
is not one you deploy when a drive is already suspect or in trouble and 
then expect miracles.  Rather, it's one that gives best service if you 
employ it regularly.

Point 4.
It's been reliably reported that some HD manufacturers cause their HD
ROMs to _lie_ in their reporting of SMART (Self-Monitoring, Analysis,
and Reporting Technology) data, to make their drives artificially look
good.  (Some values reported in the SMART data rely on the manufacturers
to specify what's a normal value, e.g., operating temperature, and
there's a temptation to be Pollyanna-ish about one's own drives.)  A
Google study found that, partly as a result, scrupulous SMART monitoring
was able to predict only about half of all drive failures.

But, what you really asked is:  What brands are good?  I have
prejudices^W opinions.  So does everyone else.  We might or might not
know what we're talking about.  Our experiences might not be
representative, or it might not be recent, or it might be tied to our
purchase from particularly good or particularly sucky retail sources.
Yes, where you buy retail goods can matter a _lot_, because some
vendors pick up "seconds" inventory that has more problems, and/or 
deals with customer warranty returns by just re-boxing it and selling it
again.

All that aside, one source of predictive information that's pretty
relevant and objective is warranty duration:  Check the length and terms
of warranties.

There was a time when most PATA (parallel IDE) drive manufacturers were
dropping their warranty periods to 3 years, or even in some cases to 1
year.  Suddenly, Seagate Technology, which had been suffering some PR
problems, announced that it was putting its warranty terms back out to 5
years.  Dunno about you, but *I* find that to be a fairly convincing
display of confidence in one's product.

As always, beware of old data.  (The Seagate anecdote, supra, was
important in its day, I think.  You should seek more-current data if
making purchase decisions now, however.)

[1] I _could_ point out that the Ubuntu user Web forums are particularly
infamous in this area, but that would be Bad and Wrong.  ;->