[conspire] disk: one of those "fix" stories

Michael Paoli Michael.Paoli at cal.berkeley.edu
Fri Nov 23 11:23:59 PST 2018


> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
> To: conspire at linuxmafia.com
> Subject: Re: 3rd Master Hard Disk Error
> Date: Thu, 22 Nov 2018 15:10:32 -0800
>
> Anyway one of those "fix" stories I have from earlier this year is pretty
> interesting - may pass that along, ... but alas, that was work goop, so
> would need to redact at least some (notably identifying) bits of it (but
> most is sufficiently generic it doesn't matter).

So ... bit of background.

An f5 "appliance" - this particular case and "appliance", the f5 ...
actually a pair of them - one specifically addressed here - the nominal
standby), and the other the active primary.  They're Linux based devices,
but ... "appliance" - most of that Linux stuff relatively "hidden",
behind-the-scenes, but ... actually accessible.  And, "of course",
all of f5's stuff (software, etc.) layered atop that.

In this particular case, non-production ("lab"), and suffering some
neglect (priorities/resources) and failure(s), we're in situation where
we've got nominally pair of mirrored disks on the unit ... but one has
failed (actually quite a while earlier), and has been replaced (so
effectively no data on the replacement drive), but the other is also
giving hard failures(!).  With the support from f5, we'd gotten to the
point where they're basically, "Yeah, you need to rebuild that from
scratch/backups - can't access/recover the data on there."

Myself, of course, knowing it's all atop Linux, I decide to dig a bit
and see if I can recover/fix it without too much difficulty, and avoid
the hassle/complexity (fair number of manual steps and time) to rebuild
and restore from backup(s), etc.

Also, Linux *based* ... but lots of f5 stuff atop that, so ... one can't
presume *too* much about it.  E.g. *reading* data is generally fine,
but changing stuff can be (potentially very) hazardous - e.g. if one
changes thing(s) at Linux layer that f5 would expect to be changed
through the f5 interfaces, chaos may ensue - as the f5
software/configurations wouldn't know about or be expecting those
changes.  So, one needs be quite mindful of that and sufficiently
careful.  Anyway, notwithstanding that major caveat, one can learn, at
the Linux layers, much of what the "appliance" is doing, how, some of
where it is/isn't healthy, etc.  And, in some cases, potentially -
carefully - fix/change *some* things.  Also, this being the nominal
standby unit, rather than active, we've got much more flexibility to
bring things down, etc. - so that makes it quite a bit easier to deal
with.

Also, these particular f5 "appliances" - they're physical x86 based
hardware units.  So there's a physical host at that level.  They *also*
support virtualization, so there are *also* "guest" Virtual Machines
(VMs) running atop the hardware - again Linux-based, with f5 stuff atop
that - on both the host running direct on physical, and the "guest" VMs
within - which in most regards look very highly similar to the host
on physical - including all the f5 bits within (just doesn't have yet
another VM layer down underneath the guests - so no guests of guests).

So, again, the context - f5 - failed/failing hard drive (spinning rust),
nonrecoverable error(s), but otherwise (mostly) operational (at
Operating System (OS) level), etc.  Two hard drives, nominally mirrored,
the "good" drive is effectively without data - it's a replacement that
was earlier installed for other disk that had much earlier failed
and was nowhere near current on the mirroring anyway.  So at this
point essentially one good drive without data, and one problematic
drive (throwing hard errors).

Also, slight aside - folks will reasonably "disagree" / have differing
perspectives on complexity and it's disadvantages and (sometimes)
advantages (what exactly was gained by bringing in that complexity?).
The example here has much complexity (partitions, mdraid, LVM, virtual
machines, f5 vendor layer, and also most all that complexity also within
VM guests) - and that complexity has both disadvantages (much to know /
dig through), and also advantages too (e.g. able to "drill down",
isolate & correct problem - and problem impact also quite limited in
scope due to its isolation quite far down in the layers).

Without further introduction, and slightly redacted:

From: Michael Paoli [REDACTED]
Date: Mon, Aug 6, 2018 at 11:48 AM
Subject: Re: [E] RE: Regarding Service Request #[REDACTED] | | Follow  
up from RMA [REDACTED] & [REDACTED]
To: <[REDACTED]@f5.com>
Cc: <[REDACTED]@f5.com>

Poked at it a bit over the weekend - remotely ...
was able to get the data recovered & all remirrored okay, without  
rebuild/reinstall.
We'll still probably want to go ahead and replace the disk that had  
the read failure, but at present I'm moving on to getting the  
[REDACTED] unit properly remirrored, then will circle back to  
replacing the quite suspect hardware (disk(s)) that still ought be  
replaced ... even if they may be working at the moment (notably ones  
that had hard failures earlier).

The short, and longer versions of the recovery without rebuild, in  
case you're interested/curious:
basically managed to recover the disk that was giving unrecoverable read
errors, and get it remirrored onto replacement drive - using lower level
Linux bits (not fully sanctioned f5 approach, but fully doable ... so
long as we're sufficiently careful as to not conflict with any f5
specific bits).

[REDACTED]

unrecoverable read errors seen on:
/dev/md16 /shared/vmdisks
/dev/md16             835G   14G  780G   2% /shared/vmdisks
e.g.:
md/raid1:md16: dm-2: unrecoverable I/O read error for block 12812032
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 35690808

on /shared/vmdisks filesystem, very few files and directories,
no directory read errors,
file read error only on:
[REDACTED].img
all other files read fine
*.img files used by guests (as shown by VM PID(s) having them open as
seen via fuser)
disabled guests until no PIDs had those *.img files open

can't copy full [REDACTED].img due to read errors

/shared/vmdisks filesystem is ext3 so journaling is used (overwrite of
file may be rewritten to different blocks rather than in place, and
probably is) can we read the *data* within [REDACTED].img (at least that's
actually used?)

# losetup -f [REDACTED].img
# losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
is image partitioned?
don't have partx ....

# sfdisk -uS -l /dev/loop0
Disk /dev/loop0: cannot get geometry

Disk /dev/loop0: 13054 cylinders, 255 heads, 63 sectors/track
Warning: The partition table looks like it was made
   for C/H/S=*/16/63 (instead of 13054/255/63).
For this listing I'll assume that geometry.
Units = sectors of 512 bytes, counting from 0

    Device Boot    Start       End   #sectors  Id  System
/dev/loop0p1   *         1    460655     460655   b  W95 FAT32
/dev/loop0p2        460656   2557295    2096640  82  Linux swap / Solaris
/dev/loop0p3       2557296 209713391  207156096  8e  Linux LVM
/dev/loop0p4             0         -          0   0  Empty
#

Yes it's partitioned ...
What data within do we care about?
boot area +MBR (before start of earliest partition)
data within filesystems (not counting slack space) but also including
early space within filesystem (possibly boot data or similar)
don't care about slack space within
Which can/can't we read (and thus copy) and which are or may be
problematic?
Let's try straight copy, see how far we get (expecting it to fail)
# cp -p [REDACTED].img [REDACTED].img.COPY || ls -l [REDACTED].img*
cp: reading `[REDACTED].img': Input/output error
-rw-r--r-- 1 root root 107374182400 Aug  3 16:31 [REDACTED].img
-rw------- 1 root root  49137192960 Aug  3 16:51 [REDACTED].img.COPY
#
That's almost half - should have up through first 2 partitions, and fair
bit of 3rd, but not complete
So, what we're missing in terms of actual data, is (or may at least
include) stuff within 3rd partition ... but that's LVM ... what do we
actually have in there (and how much used and not?)

let's change our existing loop to ro:
# losetup -d /dev/loop0 && losetup -r -f [REDACTED].img; losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
#
The losetup we have doesn't have --sizelimit, but it does have
-o (offset)

# expr 2557296 \* 512
1309335552
# losetup -d /dev/loop1
# losetup -r -f -o 1309335552 [REDACTED].img && losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
/dev/loop1: [0910]:6436 ([REDACTED].img), offset 1309335552
#

# pvscan -u
...
   PV /dev/loop1 with UUID wZSZUl-LrU9-kBIg-2EkM-G4wg-4nqt-J6FYVI VG  
vg-db-vda        lvm2 [98.78 GB / 51.91 GB free]
...
Here's a very interesting bit:
# ls -alsh [REDACTED].img
4.6G -rw-r--r-- 1 root root 100G Aug  3 16:31 [REDACTED].img
So ... (very) sparse file - mostly unallocated blocks.
GNU cp has an option to be highly efficient about sparse copies ...
but ... we wouldn't get a read error from a non-allocated block,
so, somewhere in the ~4.6G are read error(s)

VG name conflicts with existing so ...

# vgimportclone -n vg-db-vda.[REDACTED] -i /dev/loop1
   WARNING: Activation disabled. No device-mapper interaction will be  
attempted.
   /tmp/snap.iWI12298/vgimport0: write failed after 0 of 4096 at 4096:  
Operation not permitted
   pv_write with new uuid failed for /tmp/snap.iWI12298/vgimport0.
   0 physical volumes changed / 1 physical volume not changed
Fatal: Unable to change PV uuid for /tmp/snap.iWI12298/vgimport0, error: 5
#

but need to be rw to import ...

# losetup -d /dev/loop1 && losetup -f -o 1309335552 [REDACTED].img &&  
losetup -a
/dev/loop0: [0910]:6436 ([REDACTED].img)
/dev/loop1: [0910]:6436 ([REDACTED].img), offset 1309335552
#

# vgimportclone -n vg-db-vda.[REDACTED] -i /dev/loop1
   WARNING: Activation disabled. No device-mapper interaction will be  
attempted.
   Physical volume "/tmp/snap.BnJ12488/vgimport0" changed
   1 physical volume changed / 0 physical volumes not changed
   WARNING: Activation disabled. No device-mapper interaction will be  
attempted.
   Volume group "vg-db-vda" successfully changed
   Volume group "vg-db-vda" successfully renamed to "vg-db-vda.[REDACTED]"
   Reading all physical volumes.  This may take a while...
   Found volume group "vg-db-sdb" using metadata type lvm2
   Found volume group "vg-db-sda" using metadata type lvm2
   Found volume group "vg-db-vda.[REDACTED]" using metadata type lvm2
   Found volume group "vg-db-cpmirror" using metadata type lvm2
# vgchange -a y "vg-db-vda.[REDACTED]"
   16 logical volume(s) in volume group "vg-db-vda.[REDACTED]" now active
# lvs | fgrep vg-db-vda.[REDACTED] | sed -e 's/[       ]*$//'
   dat.log.1                vg-db-vda.[REDACTED]  -wi-a-   7.00G
   dat.maint.1              vg-db-vda.[REDACTED]  -wi-a- 300.00M
   dat.share.1              vg-db-vda.[REDACTED]  -wi-a-  20.00G
   dat.swapvol.1            vg-db-vda.[REDACTED]  -wi-a-   1.00G
   set.1._config            vg-db-vda.[REDACTED]  -wi-a-   3.00G
   set.1._usr               vg-db-vda.[REDACTED]  -wi-a-   2.20G
   set.1._var               vg-db-vda.[REDACTED]  -wi-a-   3.00G
   set.1.root               vg-db-vda.[REDACTED]  -wi-a- 392.00M
   set.2._config            vg-db-vda.[REDACTED]  -wi-a- 512.00M
   set.2._usr               vg-db-vda.[REDACTED]  -wi-a-   1.25G
   set.2._var               vg-db-vda.[REDACTED]  -wi-a-   3.00G
   set.2.root               vg-db-vda.[REDACTED]  -wi-a- 256.00M
   set.3._config            vg-db-vda.[REDACTED]  -wi-a- 512.00M
   set.3._usr               vg-db-vda.[REDACTED]  -wi-a-   1.25G
   set.3._var               vg-db-vda.[REDACTED]  -wi-a-   3.00G
   set.3.root               vg-db-vda.[REDACTED]  -wi-a- 256.00M
# vgdisplay vg-db-vda.[REDACTED]
   --- Volume group ---
   VG Name               vg-db-vda.[REDACTED]
   System ID
   Format                lvm2
   Metadata Areas        1
   Metadata Sequence No  28
   VG Access             read/write
   VG Status             resizable
   MAX LV                0
   Cur LV                16
   Open LV               0
   Max PV                0
   Cur PV                1
   Act PV                1
   VG Size               98.78 GB
   PE Size               4.00 MB
   Total PE              25287
   Alloc PE / Size       11999 / 46.87 GB
   Free  PE / Size       13288 / 51.91 GB
   VG UUID               Pr3x0P-li8m-HUAR-K5w4-v7Y0-vQjL-YyABNk

#
Note that over half the VG is unallocated blocks
Which can we recover?
# mkdir recover
# (for f in /dev/mapper/vg--db--vda.[REDACTED]*; do b=`basename "$f"`  
&& cat "$f" > recover/"$b"; done)
cat: /dev/mapper/vg--db--vda.[REDACTED]-set.1._config# : Input/output error
All LVs read fine, except the one immediately above
# blkid /dev/mapper/vg--db--vda.[REDACTED]-set.1._config
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config:  
LABEL="set.1./config" UUID="ba980429-7a35-46c5-925b-9f4eadcc4ba0"  
SEC_TYPE="ext2" TYPE="ext3"
# mkdir [REDACTED].config && mount -o ro,nosuid,nodev  
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config [REDACTED].config
# df -h [REDACTED].config
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config
                       3.0G   72M  2.8G   3% /shared/vmdisks/[REDACTED].config
#
Filesytem only 3% used anyway ...
# find [REDACTED].config ! -type d ! -type f ! -type l -print
Only files, directories, and sym links ... can we back them all up?
Yes, all read and backed up without error:
(cd [REDACTED].config && tar -cf - .) | gzip -9 > [REDACTED].config.tar.gz
What about metadata?
# blkid /dev/mapper/vg--db--vda.[REDACTED]-set.1._config
/dev/mapper/vg--db--vda.[REDACTED]-set.1._config:  
LABEL="set.1./config" UUID="ba980429-7a35-46c5-925b-9f4eadcc4ba0"  
SEC_TYPE="ext2" TYPE="ext3"
# umount [REDACTED].config && rmdir [REDACTED].config
# sfdisk -uS -d /dev/loop0
# partition table of /dev/loop0
unit: sectors

/dev/loop0p1 : start=        1, size=   460655, Id= b, bootable
/dev/loop0p2 : start=   460656, size=  2096640, Id=82
/dev/loop0p3 : start=  2557296, size=207156096, Id=8e
/dev/loop0p4 : start=        0, size=        0, Id= 0
# vgchange -a n "vg-db-vda.[REDACTED]"
# vgdisplay vg-db-vda.[REDACTED]
   VG UUID               Pr3x0P-li8m-HUAR-K5w4-v7Y0-vQjL-YyABNk
# vgrename Pr3x0P-li8m-HUAR-K5w4-v7Y0-vQjL-YyABNk vg-db-vda
   Volume group "vg-db-vda.[REDACTED]" successfully renamed to "vg-db-vda"
# mv [REDACTED].config.tar.gz [REDACTED].img.COPY recover/
# rm recover/vg--db--vda.[REDACTED]-set.1._config
#

# ssh -ax -l root [REDACTED] 'cd /shared/vmdisks && { df -h .;  
/bin/hostname; ls -d rec*; }'
Password:
Filesystem            Size  Used Avail Use% Mounted on
/dev/md16             835G   14G  780G   2% /shared/vmdisks
[REDACTED]
ls: rec*: No such file or directory
# (cd /shared/vmdisks/recover && tar -cf - .) | gzip -9 >  
../[REDACTED].recover.tar.gz
# scp -p ../[REDACTED].recover.tar.gz  
root@[REDACTED]:/shared/vmdisks/recover.[REDACTED]/
Password:
[REDACTED].recover.tar.gz                      100% 1718MB  63.6MB/s   00:27
# cd /shared/vmdisks
# scp -p $(find [A-Z]* -name recover -type d -prune -o \( -type f !  
-name [REDACTED].img \) -print)  
root@[REDACTED]:/shared/vmdisks/recover.[REDACTED]/
Password:
[REDACTED]_GUEST.img                                 100%  100GB   
65.8MB/s   25:56
[REDACTED]_GUEST.info                                100%   23      
0.0KB/s   00:00
[REDACTED].info                                      100%   23      
0.0KB/s   00:00
[REDACTED]_GUEST.img                                 100%  100GB   
65.4MB/s   26:06
[REDACTED]_GUEST.info                                100%   23      
0.0KB/s   00:00
#

# mkdir recover.2
# cp -p [REDACTED].img recover.2/[REDACTED].img.COPY2
# echo $?

#

Note that it actually copied without error that time,
drive likely managed to successfully remap the sector it couldn't read.

# smartctl -x /dev/sda 2>&1 | sed -ne '/^ID#/,/prefailure/p'
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
   1 Raw_Read_Error_Rate     POSR-K   197   197   051    -    302
   3 Spin_Up_Time            POS--K   198   198   021    -    3100
   4 Start_Stop_Count        -O--CK   100   100   000    -    46
   5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    1
   7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
   9 Power_On_Hours          -O--CK   093   093   000    -    5519
  10 Spin_Retry_Count        -O--CK   100   253   000    -    0
  11 Calibration_Retry_Count -O--CK   100   253   000    -    0
  12 Power_Cycle_Count       -O--CK   100   100   000    -    44
192 Power-Off_Retract_Count -O--CK   200   200   000    -    36
193 Load_Cycle_Count        -O--CK   200   200   000    -    20
194 Temperature_Celsius     -O---K   124   100   000    -    26
196 Reallocated_Event_Count -O--CK   199   199   000    -    1
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   000    -    0
                             ||||||_ K auto-keep
                             |||||__ C event count
                             ||||___ R error rate
                             |||____ S speed/performance
                             ||_____ O updated online
                             |______ P prefailure warning
#

Notice Current_Pending_Sector is now down to 0 - whereas earlier it was 1,
so the bad sector has been mapped out now (either recovered and
remapped, or remapped when written to).  So seems we're good now.

# already copied all this stuff to other host, so:
# rm [REDACTED].recover.tar.gz && rm -rf recover recover.2
#

# mdadm --detail /dev/md16 | less
/dev/md16:
         Version : 0.90
   Creation Time : Tue Jun 17 22:25:58 2014
      Raid Level : raid1
      Array Size : 877072320 (836.44 GiB 898.12 GB)
   Used Dev Size : 877072320 (836.44 GiB 898.12 GB)
    Raid Devices : 2
   Total Devices : 2
Preferred Minor : 16
     Persistence : Superblock is persistent

     Update Time : Sat Aug  4 18:39:59 2018
           State : active, degraded
  Active Devices : 1
Working Devices : 2
  Failed Devices : 0
   Spare Devices : 1

            UUID : 165d54b9:8605b578:10b78e4a:3daa69b7
          Events : 0.2291861

     Number   Major   Minor   RaidDevice State
        0     253        2        0      active sync    
/dev/vg-db-sda/mdm.app.vcmp.dat.vmdisks
        1       0        0        1      removed

        2     253       22        -      spare    
/dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
#
# mdadm --remove /dev/md16 /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
mdadm: hot removed /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
# mdadm --zero-superblock /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
# mdadm --add /dev/md16 /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
mdadm: added /dev/vg-db-sdb/mdm.app.vcmp.dat.vmdisks
#

Did the remirror of md16 - it completed fine this time, and once it
completed, all looks fine on the drives.

While that was going on (actually started bit earlier), also did:
# dd if=/dev/zero of=.nulls obs=4096; rm .nulls
dd: writing to `.nulls': No space left on device
1719802064+0 records in
214975257+0 records out
880538652672 bytes (881 GB) copied, 7375.2 seconds, 119 MB/s
#

That was to write out most blocks on the filesystem - so any that might be
marginal would get remapped ... and the mirror completed fine, so all  
we read okay.

Also ran [REDACTED] through a full reboot to ensure everything came up  
and looked fine - that was also having reenabled the guests after all  
the data was reading fine again.

Anyway, may not have shown all the steps in the preceding/above, but
showed at least key steps and data points.





More information about the conspire mailing list