[sf-lug] High(er) availability for SF-LUG site(s) (& some BALUG stuff too).

Mon Dec 28 00:32:01 PST 2015

Yes, not *quite* a push-button operation ... but almost.

Have it down to a simple copy-paste operation to do the live migrations
(and yes, if I get sufficiently tired of repeating that it may migrate
to a script or two, which would handle that, and the prerequisite
operations and checks - namely that target host is up and relevant ssh
key(s) are active in ssh-agent).

And, we can see here, ... the host that runs the SF-LUG website, has
been up much longer that both of the two physical hosts which it runs on
(actively running on only and exactly one of the two physical hosts at
any given time).  (Was eventually time to reboot one of the physical
hosts to pick up some newer kernel changes, and was also desired to
validate a bit of GRUB reconfiguration - the other physical host
currently generally spends most of it's time down, rather than up ...
loud fan(s) and all that).

$ date -Iseconds; ssh -ax sf-lug.org. 'hostname; uptime'; hostname;  
uptime; ssh -ax vicki.sf-lug.org. 'hostname; uptime'
2015-12-28T00:20:43-0800
balug-sf-lug-v2.balug.org
  00:39:44 up 30 days,  7:18,  7 users,  load average: 0.00, 0.01, 0.05
tigger
  00:20:44 up  1:14, 21 users,  load average: 0.15, 0.20, 0.36
vicki
  00:20:44 up 25 min,  1 user,  load average: 0.08, 0.03, 0.05
$
balug-sf-lug-v2.balug.org is the canonical name of the (virtual machine)
host which the SF-LUG website is presently on.  "tigger" and "vicki" are
the physical hosts it spends it's time running atop these days - most of
the time on tigger, but generally moved to vicki when tigger won't be
available at that location and IP address that SF-LUG uses.

And the typical live migrations to ("vicki") and from, respectively
generally look like this:

# SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK time virsh migrate  
--live --persistent --copy-storage-all --verbose balug  
qemu+ssh://192.168.55.2/system &&  
SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK ssh -ax -l root  
192.168.55.2 'virsh autostart balug' && virsh autostart --disable balug

# SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK ssh -Ax -l root  
192.168.55.2 'time virsh migrate --live --persistent  
--copy-storage-all --verbose balug  
qemu+ssh://tigger.mpaoli.net./system' && virsh autostart balug &&  
SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK ssh -ax -l root  
192.168.55.2 'virsh autostart --disable balug && cd / && echo  
'\''shutdown -h -P +05'\'' | batch; sleep 5'

> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
> Subject: Re: High(er) availability for SF-LUG site(s) (& some BALUG  
> stuff too).
> Date: Mon, 07 Dec 2015 06:54:48 -0800

>> From: "Bobbie Sellers" <bliss-sf4ever at dslextreme.com>
>> Subject: [sf-lug] SF-LUG meeting of Sunday 6 December 2015
>> Date: Sun, 6 Dec 2015 16:04:17 -0800
>
>>    Michael P.  quickly followed and showed us how he could run
>> the SF-LUG web site on either the notebook he brought with
>> him or on the blade server he has at home.
>
> And yes, did this "exercise" again yesterday.  Live migration of the VM.
> Didn't quite save all the output (some of it already beyond available
> buffer to scroll back in some windows/screens), but between buffer and
> history, etc., went pretty much about like this:
>
> Both NIC cards in "vicki" (the physical hardware box - 1U unit that came
> from the colo) support Wake-on-LAN, so use of handy utility to send some
> of those packets and ... that box powers itself up (no need for me to
> get up or reach to touch a button on it).
> $ wakeonlan 00:30:48:91:97:90
>
> Then, for simplicity, let's call the physical hosts, source (in this
> case my latop), and target (in this case "vicki").  "vicki" also has an
> Intranet RFC-1918 address (192.168.55.2), as does my laptop
> (192.168.55.1 - at least most of the time).  With some appropriate ssh
> key(s) in use (at least saves needing to type root password(s)), I start
> the migration of the balug (yes, balug VM - the SF-LUG stuff is hosted
> on that VM) - I also included use of time, to see how long it took.
> With qemu-kvm, we're also able to use --copy-storage-all - which will
> copy over the disk/drive storage used by the VM - so we don't even need
> the VM to have physical storage in common between the source and taget
> host.  For security, we use the qemu+ssh method, and using our Intranet
> RFC-1818 IP address.  And why the security concern?  When doing a live
> migration, the contents of the virtual machine's RAM, CPU registers and
> state, all have to be copied over, and in this case, likewise also all
> of it's disk/drive storage - so we want to do that in a sufficiently
> secure manner (e.g. most certainly not in the clear across The Internet!).
> # SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK time virsh migrate  
> --live --persistent --copy-storage-all --verbose balug  
> qemu+ssh://192.168.55.2/system && virsh autostart --disable balug
> Migration: [100 %]
> 0.18user 0.12system 6:33.51elapsed 0%CPU (0avgtext+0avgdata 9380maxresident)k
> 32inputs+0outputs (0major+975minor)pagefaults 0swaps
> Domain balug unmarked as autostarted
>
> #
> Also, after the migration successfully completes (&& from shell - if
> the prior pipeline exits with zero (normal successful completion), we
> then proceed to conditionally execute disabling autostart for that VM on
> the source host.  With libvirt/virsh, if a VM is set to autostart, if
> the physical host is (re)booted, the VM will also be started up.  At any
> given time, we want this VM to be configured to autostart on only and
> exactly one of the two physical hosts, so, e.g. if either or both
> physical hosts get rebooted, the VM does come up, but/and only on one of
> the two hosts (would conflict if it started on both, and would also not
> be desired for it to come up on neither).  I did also check after
> migration that it was set to autostart on the target - not sure if it
> was or not, but if it wasn't, I set it to do so.  Also, 6:33.51elapsed
> seems kind'a long, right?  Well, but the virtual machine continues to
> run quite continuously that whole time ... with exception of the order
> of around some milliseconds or less.  As it's doing the migration, since
> it's still running, stuff in (virtual) RAM of the VM continues to
> change, as may it's storage data too.  All that is tracked, and the bits
> that change are additionally copied, but eventually there's only a bit
> left to copy, and it's pretty active and continues to change.  At that
> point the VM is suspended on source, those last bits are copied over
> (taking around some milliseconds or less), and then the VM is started on
> the target (and gratuitous ARPs sent), and the migration is completed
> and the VM is then running on the target.  Also, since the migration
> includes copy of the storage, that substantially adds to the total
> elapsed real time, but doesn't significantly extend the very short time
> between when VM is suspended on source and started on target - that's so
> short and fast that for most things it doesn't matter at all.
>
> And then after coming home, I essentially reverse the process.  Here
> balug-sf-lug-v2.console.balug.org. is a DNS name (canonical) for the
> "vicki" physical host.
> $ SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK ssh -Ax -l root  
> balug-sf-lug-v2.console.balug.org. 'umask 077 && time virsh migrate  
> --live --persistent --copy-storage-all --verbose balug  
> qemu+ssh://tigger.mpaoli.net./system'
> Migration: [100 %]
>
> real    22m7.181s
> user    0m0.268s
> sys     0m0.248s
> One may notice it's an even longer elapsed time.  That's mostly because
> of the laptop storage I have configured for use by the VM - it's very
> space efficient (deduplication plus aggressive compression plus extra
> strong levels of integrity checking on deduplication (not only must the
> hashes match, but full compare of data blocks are also done before
> they're considered duplicates and deduplication on the blocks done)) -
> this gives very space efficient storage, at a cost of performance -
> especially write performance - but that's find and the tradeoff I want
> in this case anyway, as most of the time there isn't all that much write
> activity for that storage, and space is more of a consideration anyway.
> But when doing the migration back to laptop, the write activity is
> necessarily high, and hence it takes fair bit longer.
>
> And after that migration back to the laptop completes, I disable
> autostart from what was the source ("vicki"), and enable autostart on
> the laptop.
> $ SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK ssh -Ax -l root  
> balug-sf-lug-v2.console.balug.org. 'umask 077 && virsh autostart  
> --disable balug'
> Domain balug unmarked as autostarted
>
> # virsh autostart balug
> Domain balug marked as autostarted
>
> #
>
> And with the VM no longer running on "vicki", I shut vicki down ... and
> with that away goes the annoying relatively loud fan noise.
> $ SSH_AUTH_SOCK=/home/m/michael/.SSH_AUTH_SOCK ssh -Ax -l root  
> balug-sf-lug-v2.console.balug.org.
> root at vicki:~# echo 'shutdown -h -P +02' | batch
> warning: commands will be executed using /bin/sh
> job 13 at Sun Dec  6 21:20:00 2015
> root at vicki:~#
> Broadcast message from root at vicki (Sun 2015-12-06 21:20:08 PST):
>
> The system is going down for power-off at Sun 2015-12-06 21:22:08 PST!
>
>
> root at vicki:~# exit
> logout
> Connection to balug-sf-lug-v2.console.balug.org. closed.
> $
> Most of the time when I do a shutdown, unless I'm in some extreme rush,
> I give it 2 minutes.  Why?  Mostly just in case someone(/something) logs
> in between when I check (I didn't show that above) if anyone else is
> still logged in, and when I start the shutdown - that way they at least
> get a "two minute warning".  Most of the time there's not such urgency I
> need to shut down host faster than that.  But if I know that no one /
> nothing else could even possibly log in, or I need or want to shutdown
> faster anyway or am not concerned about any other logins, well, then I
> may, and certainly sometimes do, shutdown much faster ("now").  Oh, and
> I fed it into batch(1) in the example above - so I could logout, without
> that possibly impacting the running shutdown.
>
> So ... with all that, most of the time I don't have to put up with the
> fan noise from "vicki", but when I shutdown my laptop or take it out
> with me, I arrange that the SF-LUG site is still up and running -
> migrating it to "vicki" beforehand, and back to the laptop again after
> lapotp is up and running and doing so again at home.
>
> ... not quite down to single command scripts ... yet.  But when I get
> sufficiently tired of copy/paste of those few little bits, it'll evolve
> into a pair of scripts.
>
>> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
>> Subject: High(er) availability for SF-LUG site(s) (& some BALUG stuff too).
>> Date: Thu, 26 Nov 2015 11:30:47 -0800
>
>> High(er) availability for SF-LUG site(s) (& some BALUG stuff too).
>>
>> This covers the web sites & DNS master (list is hosted separately by
>> Rick Moen), and this also covers fair bit of BALUG stuff too (everything
>> *except* BALUG's [www.]balug.org and list stuff - e.g. wiki, archives,
>> test/beta/staging sites, etc.)
>>
>> Anyway, more-or-less per earlier plan, I did get so far yesterday, as
>> doing the first *live* migration of that host.  And also *without*
>> shared storage (which does also work perfectly fine - just takes a wee
>> bit longer for the actual overall migration, but still when the actual
>> final switch itself happens for guest VM itself, it's exceedingly fast
>> (on the order of 10s of milliseconds or so? - I haven't actually timed
>> that final bit quite yet).  So, ... that VM host is no longer "stuck"
>> just on my personal laptop :-) ... which means it can very much remain
>> up and Internet accessible - even when my laptop isn't (or, e.g. travels
>> away from home).
>>
>> $ wakeonlan 00:30:48:91:97:90
>> ...
>> Not 100% anticipated, but not a huge surprise, and easy enough to
>> address - some of the last kinks to be worked out were in allowing the
>> live migration to be successful.  CPU type and flags/capability:
>> error: unsupported configuration: guest and host CPU are not  
>> compatible: Host CPU does not provide required features: popcnt,  
>> sse4.2, sse4.1
>> Ah, so laptop CPU wee bit more modern than that on "vicki" - and by
>> more-or-less default, guest CPU was configured to take advantage of
>> many/most of those host CPU capabilities.  Easy enough to deal with that
>> - bring the VM guest down, reconfigure the virtual CPU to disable those
>> 3 capabilities, bring VM guest back up again, and repeat the attempt -
>> made it fine past that error.  Next glitch was a bit more puzzling:
>> error: internal error: unable to execute QEMU command 'migrate':  
>> State blocked by non-migratable device '0000:00:05.0/ich9_ahci
>> Wee bit 'o search and ... QEMU can't live migrate SATA (at least not
>> yet safely in version I'm using, and at least by default migration of
>> such is disabled for safety reasons).  Bring host down, take virtual
>> hard drive off of SATA, turn it into SCSI and attach it to SCSI ... and
>> ... same error?  Checked configuration again - nothing attached to
>> (virtual) SATA bus/controller, but the SATA bus/controller still there,
>> ... next step, remove those, and repeat ... and ... success, all went
>> fine, no errors:
>> # virsh migrate --live --persistent --copy-storage-all --verbose balug \
>>> qemu+ssh://192.168.55.2/system && virsh autostart --disable balug
>> ... and all went fine and dandy.  And then, live migrating back:
>> # virsh migrate --live --undefinesource --copy-storage-all --verbose \
>>> balug qemu+ssh://tigger.mpaoli.net./system
>> And that went perfectly fine too, not so much as a glitch to notice on
>> the guest VM itself (though the storage replication took a while, so
>> it's not a speedy move from perspective of the physical hosts) ... TCP
>> connections between guest and Internet, etc., all maintained perfectly
>> fine across the live migrations.
>>
>> Wee bit more stuff to do / work on ... e.g. (at least theoretically),
>> o Turn it into a (nearly) push-button operation (run one relatively
>>  simple script - or pair of scripts - partially drafted, but yet to
>>  polish those off.).
>> o Investigate/test --copy-storage-inc - if suitable and safe, that may
>>  significantly speed the disk data copy portion of the migration (some
>>  of the storage I have set up is highly optimized for physical storage
>>  space reduction, but consequently has very low write performance
>>  characteristics - which is mostly quite fine, but slows migration
>>  especially back to laptop; read performance, however, is more than
>>  sufficient.  E.g. on physical host (laptop) we have:
>>  # ls -hnos balug-sda
>>  4.8G -rw------- 1 114 16G Nov 26 18:41 balug-sda
>>  #
>>  Quite efficient (deduplication + compression) space utilization - but
>>  at cost of write performance (and some CPU burn, particularly on
>>  heavy writing, and some more suck of RAM too) - but that happens to
>>  be the trade-off I want the majority of the time for that storage -
>>  so that's highly acceptable (laptop SSD is "only" about 150 GiB ...
>>  and I've a whole lot of other stuff on it too - I'm fine with LUG VM
>>  taking ~5 GiB of physical storage ... but not gonna give it 16 GiB!).
>>  Where it resides, it also does deduplication across some ISOs that
>>  quite correspond to the installed operating system (and also other
>>  data), so that also aids in reducing total physical storage space
>>  consumed.
>>  ).
>> o I'll also carefully review, and likely adjust/tweak other bits of the
>>  migration options and handling of the VMs after migration - mostly
>>  notably bits regarding undefine or not, and autostart or not - and
>>  where.  And of course test it all out more fully.  :-)
>>
>>
>>> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
>>> Subject: Re: sf-lug site & hardware
>>> Date: Tue, 24 Nov 2015 06:32:02 -0800
>>
>>> Just an FYI update.
>>>
>>> So, my (overly optimistic) theoretical timeline - was hoping to have
>>> the sf-lug site relocated onto the higher availability hardware
>>> (notably not on VM on my laptop) by around 2015-11-15 or so.  Have
>>> adjusted the target timeline a bit, after some considerations (and also
>>> being relatively busy with other stuff too).  Anyway, one thing I
>>> didn't fully take into account earlier - fan noise.  That system that
>>> was in the colo - 1U unit, is comparatively noisy (I've gotten a bit
>>> spoiled mostly not listening to fan noise of such volume - even though
>>> it uses a fan and airflow design that mostly avoids tiny 1U high-RPM
>>> fan(s) - it's still noisier than most typical desktop systems - but
>>> less noisy than many typical 1U servers).  So, ... I adjusted my
>>> (theoretical) plans a bit.  With wakeonlan, qemu-kvm live migration,
>>> and wee bit of infrastructure (which I was mostly planning to do
>>> anyway), and small bit of scripting, I could arrange to have the VM
>>> running on the noisier (but higher availability) hardware, mostly only
>>> when it wouldn't be running on my laptop at home.  And with live
>>> migration, the migration would be effectively "invisible" to the guest
>>> VM host itself, its state, connections to it and sessions on it, etc.
>>> Anyway, fair bit closer to having that plan fully implemented.  Current
>>> target timeline for completion, by 2015-11-29, or at least not later
>>> than 2015-12-13.  May be fair bit sooner.  I'll update once it's in
>>> place and fully operational (did get a fair chunk of related
>>> infrastructure completed yesterday and today).
>>>
>>> references/excerpts:
>>> https://en.wikipedia.org/wiki/Wake-on-LAN
>>> https://en.wikipedia.org/wiki/Live_migration
>>>
>>>> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
>>>> Subject: Re: sf-lug site & hardware
>>>> Date: Thu, 12 Nov 2015 14:01:41 -0800
>>>
>>>> FYI, this morning Jim Stockford and I did retrieve the physical server
>>>> host from the colo, upon which, up until some months back, the sf-lug
>>>> web site was running.  So, that improves the hardware resource
>>>> situation.  I'm guestimating I'll have the sf-lug website again running
>>>> on VM atop this hardware by sometime this weekend or so - that should
>>>> improve the availability a fair bit (notably the sf-lug website then
>>>> won't go down when my personal laptop goes down, offline, or out the
>>>> door from home).
>>>>
>>>> Thanks Jim!
>>>>
>>>>
>>>>> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
>>>>> Subject: Re: Have you guys thought about  
>>>>> http://www.freelists.org/ (hosted ...)
>>>>> Date: Wed, 11 Nov 2015 18:26:25 -0800
>>>>
>>>>> would be down or that it wasn't (relatively) high availability (at least
>>>>> compared to virtual machine running on my personal laptop - which does
>>>>> have the sf-lug site go out when my laptop goes out ... hopefully that
>>>>> situation will be improved in near future ... waiting on some resources
>>>>> to be able to do that.)
>>>>>
>>>>> references/excerpts:
>>>>> http://linuxmafia.com/pipermail/sf-lug/2015q4/011454.html
>>>>> http://linuxmafia.com/pipermail/sf-lug/2015q4/011441.html
>>>>>
>>>>>> From: Shane Tzen <shane at faultymonk.org>
>>>>>> Date: Wed, 11 Nov 2015 15:56:14 -0800
>>>>>> Subject: Re: [sf-lug] updated/upgraded: SF-LUG - operating  
>>>>>> system presently hosting
>>>>>> To: Michael Paoli <Michael.Paoli at cal.berkeley.edu>
>>>>>> Cc: SF-LUG <sf-lug at linuxmafia.com>
>>>>>>
>>>>>> Have you guys thought about http://www.freelists.org/about.html ?
>>>>>>
>>>>>> Looks like various LUGs are hosted -
>>>>>> http://www.freelists.org/cat/Linux_and_UNIX
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 30, 2015 at 3:41 AM, Michael Paoli <
>>>>>> Michael.Paoli at cal.berkeley.edu> wrote:
>>>>>>
>>>>>>> It's been updated/upgraded:
>>>>>>> from: Debian GNU/Linux 7.9 (wheezy)
>>>>>>> to: Debian GNU/Linux 8.2 (jessie)
>>>>>>>
>>>>>>> http://lists.balug.org/pipermail/balug-admin-balug.org/2015-October/002989.html
>>>>>>>
>>>>>>> Still definitely *not* high availability though (alas, still sits atop
>>>>>>> a virtual machine on my *laptop*!).
>>>>>>>
>>>>>>> Hopefully in not too horribly distant future (like *real soon*), the
>>>>>>> physical box the site was earlier running upon will be successfully
>>>>>>> retrieved - once that happens, some high(er) availability options
>>>>>>> become possible.
>>>>>>>
>>>>>>> Let me know if you notice anything awry (notwithstanding the less than
>>>>>>> high availability).
>>>>>>>
>>>>>>> From: "Michael Paoli" <Michael.Paoli at cal.berkeley.edu>
>>>>>>>> Subject: It's alive*!: Re: SF-LUG - DNS, web site, ..., etc.
>>>>>>>> Date: Mon, 24 Aug 2015 03:10:26 -0700
>>>>>>>>
>>>>>>>
>>>>>>> Anyway, have taken the liberty ...
>>>>>>>> it's alive* ...
>>>>>>>> the [www.]sf-lug.{org,com}
>>>>>>>> websites are available again.