[conspire] 3rd Master Hard Disk Error

Thu Nov 22 15:10:32 PST 2018

> Date: Sun, 18 Nov 2018 19:21:55 +0000 (UTC)
> From: "paulz at ieee.org" <paulz at ieee.org>
> To: Conspire List <conspire at linuxmafia.com>
> Subject: [conspire] 3rd Master Hard Disk Error
> Message-ID: <937443752.2841122.1542568915791 at mail.yahoo.com>
> Content-Type: text/plain; charset="utf-8"
>
> During POST and before GRUB, one of my computers gives the error message
> ??? 3rd Master Hard Disk Error
> and says press F1 to continue.
> Well this PC has 3 hard drives.? Which one is it??? Doing a  
> web-search I find "information" in 2 categories:
>
> * Access the BIOS to turn off the test.
> * Replace "the hard drive".
>
> At first I thought "3rd" meant a specific drive like /dev/hdc.??  
> Apparently computers with only 1 drive generate the same message.
>
> I have taken the precaution to make copies of important data on  
> different physical drives.

Well, not a whole lot of detail (some more in subsequent, but
still not all that much).

Backups ... yep (whole 'nother topic)

So, first thing I think of with error diagnostics:
they're often the least exercise code in the system/software,
so, often though they're not so likely to be totally inaccurate,
it is fairly common that what they claim or state the issue to be
isn't quite right ... but if not spot on, it's typically something
closely, or at least somewhat associated with whatever' is actually
failing/failed or misbehaving, etc. - but even then, not always.
So, always take error diagnostics with a grain of salt and presume that
they may not be exactly correct (and even sometimes totally off-base).

"3rd Master Hard Disk Error"

So, we (or at least I would) start to question and be skeptical of
the diagnostics (sometimes they're quite unclear, e.g. I remember one
from BIOS years ago: "But Segment Doesn't Found"
So, yeah, diagnostics may not be highly clear in what they're
attempting to communicate.

So ... "3rd" ... 3rd ... what, drive?  Or 3rd time that error's been
caught?  Maybe it counts the errors but only counts/buffers up to 3,
and doesn't report 'till it's hit that threshold of 3?  "3rd" ...
counting starting from 0, or 1?

Is there any hardware RAID involved?  If so, the diagnostic might not
be about physical drive(s), but about some (sub-)component of the
RAID or some logical portion thereof, etc.

"Master" - sounds fairly likely to be "Master" of Master-Slave ...
where Slave may not even be present - so maybe referring to the
Master/Primary on IDE/ATA/PATA, and being the "3rd", it might count
'em:
1st: Master on 1st IDE/ATA/PATA bus
2nd: Slave on 1st IDE/ATA/PATA bus
3rd: Master on 2nd IDE/ATA/PATA bus
...
Does it count floppies first?  Even if they're purely legacy/vestigial
and may not physically exist or have controllers, but might still be
in the BIOS code because nobody took it out yet?

I'd probably proceed - presuming there's no hardware RAID involved to ...
well, does also depend upon priorities and objectives.  If one is more
concerned about data and less about host and its services remaining up
as feasible, may want to reboot to single user mode - or if keeping
services up/running as feasible is more important than the data, then
don't do that.  ;-)

Try a full read of all the physical drives, e.g.:
might want to mount a bit 'o USB flash or SD or the like to have a
place to write some stuff ... or log stuff while connected over network
or via serial console ... anyway, maybe something like:
# dd if=/dev/sda ibs=512 of=/dev/null &
Can likewise do that for each drive (e.g. also /dev/sdb, /dev/sdc, ...
whatever drives you have that may possibly be suspect)
Can also redirect stderr on those, e.g. 2>sda.dd.err
Might also look at existing log files - it may have caught and logged
relevant error(s) there.
In any case, once those dd processes have completed, did any give any
errors?
If all the disks read all through without errors, then you may be in good
to excellent shape - and would remain the issue/question of the curiosity
of what caused the diagnostic and is there still an issue - or possibly
intermittent issue - or not.
You can also examine the SMART data on the drives - there are utilities
that can do that - that can be quite useful/informative.  That can
also let you know if you have a drive that's failed or is in imminent danger
of doing so ... "of course" just because everything looks fine on the
drive doesn't mean it can't or won't fail at any time anyway (redundancy,
backups!, ...).  In many cases, the SMART data & utilities can also
tell you if you have a version of firmware on the drive that should be
updated - some older firmware often has bugs - sometimes even
critical/important ones, where the drive firmware ought get updated.

If you got error(s) reading ... where, what drive, what's on it?
If the errors are solid/consistent, rather than intermittent, can
generally work to isolate - drive ... partition ... filesystem or
what have you ... what on or within filesystem or file or raw data
storage?  Sometimes one may find the error is on the filesystem,
but not in an area that's in use currently.  Sometimes one will find
the error is on something in use - e.g. within only and exactly one
specific file.  Depending what it is, it might be semi-simple to
repair.  If you get a hard read failure, for non-ancient drives, they
generally map bad blocks out - if it's a hard read error and has spare
blocks, it will still generally map it out when written.  So, overwrite
the file (or that bit of it - know the data you need to fix it?), and
you "fix" the issue.

If you have the time/luxury to be able to take drive(s) off-line
and do destructive write tests on the drive(s), one can test quite
a bit more thoroughly.  Those tests can also take quite a bit of
time, and may also be overkill, depending what one's objectives are.
In many such cases, I'll just do one single overwrite and read back -
especially for larger drives, and will mostly consider that "good enough"
for uses that aren't quite to highly critical.  Also, random access on
spinning rust (hard drives, as opposed to SSD), can give very different
(typically much more thorough and closer to real world use most of the
time) results than sequential ... but sequential is much much faster for
spinning rust (and random access is of no significant advantage to test
over sequential, for SSD).

Oh, and if one has a SSD / hard disk hybrid drive (combines both)
... not sure how to best test something like that - might have to
resort to drive vendor tools and/or rely upon SMART data some fair
bit more.  In general, vendor diagnostic/test tools can often do
fair bit more low-level stuff on the drive - but most of the time
I don't bother with such - I figure the drive works, or works
"well enough" and is "sufficiently reliable" ... or ... it ain't
and is time for wipe+ecycle.  I figure at least most of the time
if I have to resort to vendor tools to get a drive working again,
that drive is too unreliable (notwithstanding possibly updating
drive firmware - and if feasible without needing vendor specific
tools/software to do that).  One noteworthy exception - drive still
under warranty?  May need to use vendor tools to convince vendor the
drive is failed and they need to replace it under warranty (thus far on
my personal stuff, really only had one drive fail within warranty,
and yes I did get warranty replacement drive).

I've had multiple occasions where I've gotten hard read failures on
drives, and have been able to "repair" them in such manner (e.g.
one drive, I had that happen a grand total of twice over a period
of about 10 years, and that drive was in nearly continuous operation
most all of that time, and earlier this year likewise "fixed"
a drive that had issue like that - and from there was then able to
remirror that drive to another drive - then replace the one that
had the error and then mirror back to it again to have good solid
RAID-1 once again and on non-flakey drives).
If you get the (very rare?) hard read errors on drive, you might consider
them at least somewhat less reliable - sometimes it happens.  One may
want to replace them ... or not.  SMART data can give you a fairly good
idea of how (semi-)okay ... or definitely not, the drive is - or if it's
likely to fail again soon or have developing progressive problems.

Anyway one of those "fix" stories I have from earlier this year is pretty
interesting - may pass that along, ... but alas, that was work goop, so
would need to redact at least some (notably identifying) bits of it (but
most is sufficiently generic it doesn't matter).