[sf-lug] What *did* happen to the host at the colo?: Playing "detective" with data logs and other data
Michael Paoli
Michael.Paoli at cal.berkeley.edu
Sat Nov 28 19:00:26 PST 2015
Playing "detective" with data logs and other data.
So, was curious, "vicki" - the physical host that was in the colo
... what can we determine happened from logs and other captured data?
Timestamps PST8PDT unless otherwise noted
In chronological order, showing the more "interesting" bits, I find or
had noted:
2015-08-19T08:01:49-0700 last data caught on remote backup
/var/log/authlog.log
(I'd increased the frequency of backups to overnightly, expecting the
box to be "unplugged" (or equivalent) "any day now".)
< /var/log/kern.log.4.gz gzip -d | grep 'eth0 NIC'
Aug 19 20:01:28 vicki kernel: [35675070.636447] e1000e: eth0 NIC Link is Down
Presumably the Ethernet cable was unplugged, or more likely, switch port
to which it was connected was disabled. In any case, at that point
host is off-line, ... though surprisingly not down yet.
Guessing it may have been someone's cron job or the like, that started
off at 8pm, and when it got to that account in the colo ... time to
disable the account, kill the switch port.
Noticed in well under 24 hours, that the sites were down:
2015-08-20T09:05:43-0700
http://lists.balug.org/pipermail/balug-admin-balug.org/2015-August/002981.html
2015-08-20T09:11:57-0700
http://linuxmafia.com/pipermail/sf-lug/2015q3/011372.html
and had the sites and essential SF-LUG services (notably web + DNS) all
operational again within 4 days 07h 08m 58s of link drop at the colo site:
2015-08-22T03:00:16-0700
http://lists.balug.org/pipermail/balug-admin-balug.org/2015-August/002983.html
2015-08-24T03:10:26-0700
http://linuxmafia.com/pipermail/sf-lug/2015q3/011381.html
Somewhat surprisingly, looks like the host remained powered up and
running until approximately sometime between/around:
Nov 12 08:35:32
and/or:
Nov 12 08:40:18
or so. Did see continued activity in the logs (e.g. cron jobs, etc.)
and no indications of an earlier shutdown, or reboot before around then.
Jim and I unceremoniously pulled power on the box (possibly pressing the
power button first?) - we weren't expecting it to still be up, and in
the noise of the data center we might not have noticed (in a quiet room,
the box is loud, in a loud data center, it's comparatively quite quiet)
... in any case, we were a bit pressed for time (I was squeezing this in
before heading off to the office), and since the host was offline
anyway, it didn't matter much.
$ grep '^2015-11-12 0[89]:' /var/log/exim4/mainlog | tail -n 1
2015-11-12 08:35:32 End queue run: pid=1740
And rather consistent with the above, we find from the uptimed/uprecords
data, it appears to have gone down around:
Thu Nov 12 08:40:18 PST 2015
$ date -d 'Wed Jul 2 22:16:58 2014 + 497 days + 11 hours - 1 hour +
23 minutes + 20 seconds'
Thu Nov 12 08:40:18 PST 2015
$ date; uprecords -a -m 200
Sat Nov 28 17:34:58 PST 2015
# Uptime | System
Boot up
----------------------------+---------------------------------------------------
1 497 days, 11:23:20 | Linux 3.2.0-4-amd64 Wed Jul 2
22:16:58 2014
2 0 days, 09:59:02 | Linux 3.16.0-4-amd64 Wed Nov 25
07:39:15 2015
3 0 days, 08:34:13 | Linux 3.2.0-4-amd64 Mon Nov 23
20:42:54 2015
-> 4 0 days, 03:52:31 | Linux 3.16.0-4-amd64 Sat Nov 28
13:42:28 2015
5 0 days, 02:43:54 | Linux 3.2.0-4-amd64 Sun Nov 22
06:53:55 2015
6 0 days, 01:31:52 | Linux 3.2.0-4-amd64 Thu Nov 12
19:24:43 2015
7 0 days, 01:25:02 | Linux 3.16.0-4-amd64 Wed Nov 25
03:31:22 2015
8 0 days, 00:47:35 | Linux 3.16.0-4-amd64 Wed Nov 25
05:14:10 2015
9 0 days, 00:21:52 | Linux 3.16.0-4-amd64 Tue Nov 24
05:17:46 2015
10 0 days, 00:13:09 | Linux 3.16.0-4-amd64 Tue Nov 24
05:40:16 2015
11 0 days, 00:05:37 | Linux 3.16.0-4-amd64 Tue Nov 24
19:28:34 2015
12 0 days, 00:01:55 | Linux 3.2.0-4-amd64 Thu Nov 12
20:57:24 2015
----------------------------+---------------------------------------------------
1up in 0 days, 04:41:43 | at Sat Nov 28
22:16:40 2015
no1 in 497 days, 07:30:50 | at Sun Apr 9
02:05:47 2017
up 498 days, 17:00:02 | since Wed Jul 2
22:16:58 2014
down 15 days, 03:17:59 | since Wed Jul 2
22:16:58 2014
%up 97.054 | since Wed Jul 2
22:16:58 2014
2015-11-12 Jim Stockford & Michael Paoli - retrieved from colo
(the above was done around 8:30am--8:55am)
May have found a bit more data by looking at the logs sooner (e.g.
before powering up again and mounting filesystems rw again), but that
wasn't a priority nor particularly important. Mostly just peeked later
as a curiosity, to see what the physical host box "vicki" could tell us
of it's "adventures" at the colo. There's also the two VMs that were
running on the host - some of their logs may also, likewise, have
provided a bit more detail and confirming evidence, etc.
data sources:
about the electronic equivalent of system log book:
file://vicki/var/local/log/log
Yes it's on the physical host itself, but it's also relatively regularly
backed up to multiple other locations (and heck of a lot easier to
search with grep and friends, and access remotely, rather than thumbing
through pages of physical log book to find some item).
log files on physical host:
file://vicki/var/log/
a present location of backup copy of last data from the "vicki" host
before it went offline:
file://tigger/var/tmp/balug/balug-sf-lug-v2.console.balug.org./root
More information about the sf-lug
mailing list