[conspire] and more, etc.: disk: one of those "fix" stories

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sat Jan 18 06:56:27 PST 2020


See also the earlier:
http://linuxmafia.com/pipermail/conspire/2018-November/009497.html

So, the most current - again spinning rust.  Though there are some bits
that would be in common with SSD, there's a whole lot that's different
with SSD - notably what goes on with the technology and physics, wear
patterns and use and typical failure modes, write and erase block sizes,
etc.  So, not covering SSD here.

First, selected bits from my systems logs (I use a certain format,
each entry is separated by an empty line, then entry starts with
line starting with particular ISO date/timestamp format, and possibly
additional information, followed by additional non-empty line(s) of
information (the log and its format is mostly for humans, but it's
also very searchable, and may also contain relevant bits/excerpts
from logs or other diagnostics, etc. ... also fairly easy to
locate stuff at/around certain dates or ranges of time).
So ... selected log bits:

2020-01-17
Seagate Barracuda 7200.11 ST31500341AS 9VS02L4Q
had some error(s), notably:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
197 Current_Pending_Sector  -O--C-   100   100   000    -    1
as seen from:
# smartctl -x /dev/sdc
Seeing further indications of error(s) on drive:
dd: error reading '/dev/sdc': Input/output error
1014514304+0 records in
1014514304+0 records out
519431323648 bytes (519 GB, 484 GiB) copied, 19784.2 s, 26.3 MB/s

2020-01-18
Seagate Barracuda 7200.11 ST31500341AS 9VS02L4Q
non-recoverable errors "fixed" ... for now:
After moving (pvmove) allocated LVM PEs off of the (LUKS) PV atop
partition 7, removing the PV from the VG (vgreduce), removing
LUKS from the partition, and running
badblocks -w
on the partition,
the unrecoverable read error is gone:
# smartctl -x /dev/sdc | fgrep -e Serial -e Pending_S
Serial Number:    9VS02L4Q
197 Current_Pending_Sector  -O--C-   100   100   000    -    0
badblocks -w
completed without errors on the partition, and was
able to successfully read all (logical) sectors on the drive that failed
before (and also all remaining sectors on the drive).

So ... wee bit more background, etc.
Drive had given me some error(s) ... it's SATA, throughout this it is
connected via laptop's eSATA port, so "external" to the laptop ... but
"just" a carefully placed bare drive (with a separate power supply
connected to it).  No USB or other cruft between laptop and the SATA
drive all straight [e]SATA connection.
After glancing over log(s) and seeing some errors there from the drive,
ran smartctl -x on it.  That's when I saw the:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
197 Current_Pending_Sector  -O--C-   100   100   000    -    1
That's bad.  :-(
A value (RAW_VALUE) other than 0 for Current_Pending_Sector basically
means the drive has sector(s) it wants to map out because they failed,
but it can't - at least currently.  E.g. it can't read the data to write
it to where it would remap it to - so it's currently a hard
unrecoverable read failure.
Well, what to do?  Drive is (quite) out of warranty,
but issue may just be one single sector ... or a handful of sectors.
And, if they're written, rather than read, the drive may automagically
correct itself - remapping the sector(s) upon write.
So ... where's the data?
The drive has physical sectors of 512 bytes (newer larger capacity
drives may have larger physical sectors - 512 byte sectors goes way back
at least well into fairly early Unix history).
So, I do a dd, with ibs=512 (and of=/dev/null), and if=/dev/sdc (where
the drive is currently ... sometimes it would move/jump to /dev/sdd ...
and back, notably if failed, not totally freed up, then rescanned -
OS would consider sdc still in use, and put the drive as sdd rather than
sdc ... one can also use persistent naming by other dev configurations,
or device path used - e.g. for this drive, this:
/dev/disk/by-id/ata-ST31500341AS_9VS02L4Q
would also be persistent name - at least if connected via [[e]S]ATA.
Checking my host presently, I find actual, all these for that same
drive:
/dev/block/8:32
/dev/disk/by-id/ata-ST31500341AS_9VS02L4Q
/dev/disk/by-id/wwn-0x5000c5000dbda806
/dev/disk/by-path/pci-0000:00:1f.2-ata-5
/dev/sdc
We could probably even use:
/dev/disk/by-id/wwn-0x5000c5000dbda806
and it would probably persist - even if the drive were connected via
other than [[e]S]ATA.

Anyway, the aforementioned dd failed to read the entire drive.
The failure's diagnostic and dd's info to stderr let me know where on
the drive the (at least first) read error happened (logs would also show
if I looked there too).  From that, I used sfdisk, to look at
partitioning, and determine where on the drive the (first) error was
found:
# sfdisk -uS -l /dev/sdc
Disk /dev/sdc: 1.4 TiB, 1500301910016 bytes, 2930277168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot      Start        End    Sectors   Size Id Type
/dev/sdc1             2048 2930277167 2930275120   1.4T 85 Linux extended
/dev/sdc5             4096  366286182  366282087 174.7G 83 Linux
/dev/sdc6        366288231  732570317  366282087 174.7G 83 Linux
/dev/sdc7        732572366 1098854452  366282087 174.7G 83 Linux
/dev/sdc8       1098856501 1465138587  366282087 174.7G 83 Linux
/dev/sdc9       1465140636 1831422722  366282087 174.7G 83 Linux
/dev/sdc10      1831424771 2197706857  366282087 174.7G 83 Linux
/dev/sdc11      2197708906 2563990992  366282087 174.7G 83 Linux
/dev/sdc12      2563993041 2930275127  366282087 174.7G 83 Linux
 From that, I could see the error was within /dev/sdc7.
That partition I had (at the time) in use as LUKS encrypted partition.
That was then, in turn, used as a PV in a VG under
Logical Volume Manager (LVM).
So ... I looked at the Volume Group (VG) portion(s) on that Physical
Volume (PV) - not all the Physical Extents (PE)s in the PV were
allocated.  I could've used dd to see what among them I could read,
but instead I jumped to pvmove.  I set about moving all the allocated
data off the PV - and it turns out they all moved off there without
error - so no read errors among the actual allocated/stored data,
and at the same time, also moved that data to elsewhere on disk.

I then proceeded to undo my "layers" on the partition - so I could
try to "fix" the error by overwriting it.
I removed the PV from the VG (vgreduce).
I freed the partition from active LUKS encryption (cryptdisks_stop ...).
And at least for now, I commented out any program / configuration bits
that would attempt to reactivate LUKS on that partition.
I then used badblocks -w (along with some additional options) on the
partition.  That not only gave no errors, but somewhat unsurprisingly,
automagically "fixed" the error - essentially remapping upon write.
(The error still showed as present before badblocks had completed its
first full write pass).

"Of course" that doesn't make the drive highly reliable - just "fixes"
the issue ... for now.  Drives can always fail at any time, and no
guarantees one can get any data back from the drive after failure(s).
I certainly couldn't get the data from the read hard failed sector back,
but conveniently in this case, it was only allocated, and didn't have
any "real" data on it.  And even if it had actual data on it, I could've
continued to isolate it - moving any other data in actual use on that
partition to elsewhere.  If the VG had spanned multiple drives, I
potentially could've even moved the data to other drives with pvmove.
But in this case, it's a stand-alone drive - which I use (almost?)
exclusively for backup purposes (have multiple such drives ... notably
for redundancy in case of drive failure(s)).  If the data couldn't have
been moved due to hard read errors within "real" data, I may have been
able to further isolate.  E.g. if it's a filesystem, if it will mount,
see what can read okay on the filesystem - where things fail on the
filesystem, one isolates to file(s) with issue(s).  Might've been,
in such a case, as little as a single file.  Or if it were on
a directory - and only directory(/ies), then what's lost is
not file data proper, but metadata.  In the land of Unix/Linux/BSD,
at least for (more-or-less) native filesystem types, directories
themselves only contain the names of the items within the directory (the
file names), and their inode numbers.  If the damage was limited to
that, fsck would generally recover the files ... they'd end up under
/lost+found with filenames corresponding to the "found" inode numbers
for each of the files - one would then need to determine the correct
name/location to be able to put them back.

Anyway, some folks don't like complexity.  Well, I'd quite argue
sometimes, appropriately used / sprinkled about, it's well worth it.
E.g. in this particular case, it made it quite easy to isolate the
issue, move the data to elsewhere on disk, and fix the issue in the
partition.  (Actually, in this particular case, LVM layer alone
would allow the data to be relocated, but fixing the issue would be fair
bit more challenging without the multiple separate partitions - as I was
able to free up the use of the partition in this case - which then made
it easier to "fix" the issue there.)

Oh, and I did also find other read errors on the drive (as
found/reported by dd used to read the drive).  There was a small
consecutive burst of 'em - immediately after the first sector with the
read error.  Thus far at that point, the SMART data - as seen by
smartctl -x
was only thus far showing one as a hard unrecoverable read error ... dd
apparently gave up sooner?  Or maybe internally the drive uses a larger
block size, but at the (SATA) presentation layer, it uses a "physical"
block size of 512 bytes (perhaps for backwards compatibility) - so
perhaps only a single true physical sector on the drive.
Let's see ... my data from dd ... how many (all were consecutive)
512 byte sectors failed? ...
if we do 512 byte sectors with our number/count starting at 0 (offset),
we have ...
first sector that failed ...
1014514304
last sector that failed ...
1014514311
So ...
1014514311-1014514304+1=8
8*512=4096
... we may guess, perhaps correctly so, that the underlying physical
sector size is 4KiB (not atypical for modern and semi-modern high
capacity hard drives), and the SATA presentation layer for this
model of drive, still presents it as 512 byte physical.  If that's
the case, it also (re)emphasizes the importance of proper alignment -
notably for best performance and wear/lifetime.

So ... partition(s) (complexity! ;-)) or not, etc.
My rule-of-thumb goes about like this:
One or two drives, each split into at least 4 partitions, not generally
more than 8, but somewhat more (e.g. up to about 16 or even 20)
may sometimes be warranted - notably/especially on the very first
drive - and may be quite similar on the 2nd, especially if (RAID-1)
mirroring is used.
3rd & 4th drives - typically 4 to 8 partitions, though 0 (no partitions)
can be fine to.
5th and subsequent drives/LUNs - DO NOT PARTITION! :-)  Why?  Makes life
*much* simpler.  In many cases these "drives" may be LUNs, or virtual,
or one may want to upgrade/replace then with larger capacity drives.
Under Linux, if the drives are not partitioned at all :-), it's much
easier to "grow" them - notably have the OS know and take advantage of
their larger space - even if/when done live!  Very similar also applies
when replacing disk with larger - do a direct image copy from smaller
drive to the larger - once that's done if the drive isn't partitioned at
all, much easier to make use of the additional space.
Whether it's directly used as filesystem, or PV in VG in LVM, or LUKS
encryption done on entire drive, much easier to grow such if the drive
isn't partitioned at all.
And yes, along with that - pet peeve.  I'll see many (especially more
jr.) sysadmins set up such (>=5th) drives, as one single primary
partition using the entire drive ... not because that's a good idea, but
because most Linux distros will do that by default when configuring the
drive for use.  Ugh ... no, don't do that - in that case one's added a
layer of complexity with about zero advantage to it, and mostly makes
growing it go from easy peasy to a more significant pain/hassle.
Also, at >=5th drive, one is more likely, if/when there are issues with
drive, to simply pull and replace it, or if virtual or SAN, potentially
grow it.

And, now that I "fixed" the drive, how reliable?  Who knows, but it
henceforward becomes a "suspect" drive, to be less trusted ... "of
course" any drive can fail at any time, and may not be able to recover
anything once failed.

So, how 'bout that automagic fixing/remapping of hard (spinning rust)
drives?  I first encountered that on a circa 2003 laptop drive.  After
modest number of year(s) in operation, it developed a hard (read) error
on the drive.  It was only and exactly a single 512 byte sector - all
else on the drive read fine.  Isolated it to within a single file.
Overwrote the file with the data that the file should contain - the
drive remapped the sector upon write - "fixed" ... problem gone.  Some
years later had same issue again on same drive (different sector) ... I
likewise "fixed" it again.  Drive went on for several more years of
(nearly) continuous service ... drive made it to around 9 years old
of (nearly) continuous service before it got too seriously generally
flakey/unreliable that effectively ended its service life (though I
could still sometimes read all the data off of the drive).

SMART data - good to look at that once in a while, etc.  It won't
necessarily tell you when your drive will likely fail soon, but
sometimes it will effectively indicate that ... sometimes will also tell
you that hey, despite the drive testing fine end-to-end with repeated
r/w testing, your drive is headed for serious trouble and likely to hard
fail on you at about any time and without further advance warning (I
have a failed drive like that ... someone gave it to me (as a failed
drive) ... tests out "perfectly fine" ... until I look at the SMART data
... then it's a very scary looking drive). ... that drive is thus far
still available if someone wants it - have a look at:
https://www.wiki.balug.org/wiki/doku.php?id=balug:offered_wanted_hardware_etc




More information about the conspire mailing list