[sf-lug] What did happen to the host at the colo?: Playing "detective" with data logs and other data

Sat Nov 28 19:00:26 PST 2015

Playing "detective" with data logs and other data.

So, was curious, "vicki" - the physical host that was in the colo
... what can we determine happened from logs and other captured data?

Timestamps PST8PDT unless otherwise noted

In chronological order, showing the more "interesting" bits, I find or
had noted:

2015-08-19T08:01:49-0700 last data caught on remote backup  
/var/log/authlog.log
(I'd increased the frequency of backups to overnightly, expecting the
box to be "unplugged" (or equivalent) "any day now".)

< /var/log/kern.log.4.gz gzip -d | grep 'eth0 NIC'
Aug 19 20:01:28 vicki kernel: [35675070.636447] e1000e: eth0 NIC Link is Down
Presumably the Ethernet cable was unplugged, or more likely, switch port
to which it was connected was disabled.  In any case, at that point
host is off-line, ... though surprisingly not down yet.
Guessing it may have been someone's cron job or the like, that started
off at 8pm, and when it got to that account in the colo ... time to
disable the account, kill the switch port.

Noticed in well under 24 hours, that the sites were down:
2015-08-20T09:05:43-0700  
http://lists.balug.org/pipermail/balug-admin-balug.org/2015-August/002981.html
2015-08-20T09:11:57-0700  
http://linuxmafia.com/pipermail/sf-lug/2015q3/011372.html
and had the sites and essential SF-LUG services (notably web + DNS) all
operational again within 4 days 07h 08m 58s of link drop at the colo site:
2015-08-22T03:00:16-0700  
http://lists.balug.org/pipermail/balug-admin-balug.org/2015-August/002983.html
2015-08-24T03:10:26-0700  
http://linuxmafia.com/pipermail/sf-lug/2015q3/011381.html

Somewhat surprisingly, looks like the host remained powered up and
running until approximately sometime between/around:
Nov 12 08:35:32
and/or:
Nov 12 08:40:18
or so.  Did see continued activity in the logs (e.g. cron jobs, etc.)
and no indications of an earlier shutdown, or reboot before around then.

Jim and I unceremoniously pulled power on the box (possibly pressing the
power button first?) - we weren't expecting it to still be up, and in
the noise of the data center we might not have noticed (in a quiet room,
the box is loud, in a loud data center, it's comparatively quite quiet)
... in any case, we were a bit pressed for time (I was squeezing this in
before heading off to the office), and since the host was offline
anyway, it didn't matter much.

$ grep '^2015-11-12 0[89]:' /var/log/exim4/mainlog | tail -n 1
2015-11-12 08:35:32 End queue run: pid=1740

And rather consistent with the above, we find from the uptimed/uprecords
data, it appears to have gone down around:
Thu Nov 12 08:40:18 PST 2015

$ date -d 'Wed Jul  2 22:16:58 2014 + 497 days + 11 hours - 1 hour +  
23 minutes + 20 seconds'
Thu Nov 12 08:40:18 PST 2015

$ date; uprecords -a -m 200
Sat Nov 28 17:34:58 PST 2015
      #               Uptime | System                                   
    Boot up
----------------------------+---------------------------------------------------
      1   497 days, 11:23:20 | Linux 3.2.0-4-amd64       Wed Jul  2  
22:16:58 2014
      2     0 days, 09:59:02 | Linux 3.16.0-4-amd64      Wed Nov 25  
07:39:15 2015
      3     0 days, 08:34:13 | Linux 3.2.0-4-amd64       Mon Nov 23  
20:42:54 2015
->   4     0 days, 03:52:31 | Linux 3.16.0-4-amd64      Sat Nov 28  
13:42:28 2015
      5     0 days, 02:43:54 | Linux 3.2.0-4-amd64       Sun Nov 22  
06:53:55 2015
      6     0 days, 01:31:52 | Linux 3.2.0-4-amd64       Thu Nov 12  
19:24:43 2015
      7     0 days, 01:25:02 | Linux 3.16.0-4-amd64      Wed Nov 25  
03:31:22 2015
      8     0 days, 00:47:35 | Linux 3.16.0-4-amd64      Wed Nov 25  
05:14:10 2015
      9     0 days, 00:21:52 | Linux 3.16.0-4-amd64      Tue Nov 24  
05:17:46 2015
     10     0 days, 00:13:09 | Linux 3.16.0-4-amd64      Tue Nov 24  
05:40:16 2015
     11     0 days, 00:05:37 | Linux 3.16.0-4-amd64      Tue Nov 24  
19:28:34 2015
     12     0 days, 00:01:55 | Linux 3.2.0-4-amd64       Thu Nov 12  
20:57:24 2015
----------------------------+---------------------------------------------------
1up in     0 days, 04:41:43 | at                        Sat Nov 28  
22:16:40 2015
no1 in   497 days, 07:30:50 | at                        Sun Apr  9  
02:05:47 2017
     up   498 days, 17:00:02 | since                     Wed Jul  2  
22:16:58 2014
   down    15 days, 03:17:59 | since                     Wed Jul  2  
22:16:58 2014
    %up               97.054 | since                     Wed Jul  2  
22:16:58 2014

2015-11-12 Jim Stockford & Michael Paoli - retrieved from colo
(the above was done around 8:30am--8:55am)

May have found a bit more data by looking at the logs sooner (e.g.
before powering up again and mounting filesystems rw again), but that
wasn't a priority nor particularly important.  Mostly just peeked later
as a curiosity, to see what the physical host box "vicki" could tell us
of it's "adventures" at the colo.  There's also the two VMs that were
running on the host - some of their logs may also, likewise, have
provided a bit more detail and confirming evidence, etc.

data sources:

about the electronic equivalent of system log book:
file://vicki/var/local/log/log
Yes it's on the physical host itself, but it's also relatively regularly
backed up to multiple other locations (and heck of a lot easier to
search with grep and friends, and access remotely, rather than thumbing
through pages of physical log book to find some item).

log files on physical host:
file://vicki/var/log/

a present location of backup copy of last data from the "vicki" host
before it went offline:
file://tigger/var/tmp/balug/balug-sf-lug-v2.console.balug.org./root

[sf-lug] What *did* happen to the host at the colo?: Playing "detective" with data logs and other data

[sf-lug] What did happen to the host at the colo?: Playing "detective" with data logs and other data