[conspire] VMs, qemu-kvm, ...

Michael Paoli Michael.Paoli at cal.berkeley.edu
Thu Dec 20 20:49:34 PST 2018


[And taking this bit on-list, because ... why not?  :-)]

> From: "Rick Moen" <rick at linuxmafia.com>
> Date: Wed, 19 Dec 2018 14:45:11 -0800

> And yes, sure, need to get serious about deploying the CompuLab Intense
> PC.  There's a hard nut to crack before migration:  I need to design a
> SSD-appropriate and security-hardened host OS.  Then I need to set up
> kvm/qemu on top of that -- something I've never done before.  These
> steps need to be solid -- because, unlike the further steps, they cannot
> be redesigned after deployment.

Well, for the most part ...

SSD-appropriate not a whole lot of changes on that, other than mostly
much better performance and much less power/heat, and no noise.
And, heck, some parts mostly get a lot simpler - don't have to worry all
that much about where/how things are placed on the drive - certainly
none of that how many cylinders away to seek over between ... all that
goes away.  Modern SSDs are pretty darn good at write leveling, so much
of the potential "hot spot" wear issues go away.  About the only thing
one may want to pay attention to - physical device block size and
alignment - most notably for writes.  Drive storage is relatively
inexpensive (per unit storage) these days, so don't worry *too* much
about it.  Most of the relevant modern OS tools will pretty much
automagically do the right thing (or reasonable approximations thereof),
if you don't go out of your way to tell them otherwise.  Most notably,
where boundaries start for partitions, etc. on drive - let the OS tools
pick that, and you'll generally be physical write block aligned.

SSD/[{micro,mini}]SD/USB flash/etc. - I wish there was - is anyone aware
of, any tools for/on Linux that give the actual physical write block
size for various flash devices?  Most notably I want to know what's the
*smallest* block size that can be efficiently overwritten - for flash,
"overwrite" requires an erase/write cycle - for all the write block(s)
impacted ... and if the write is less than the block size or not block
aligned, that's inefficient, and increases wear on the flash.

Yes, you'll want to spend some time planning the layout, and
configuration, and implementing, etc.
But don't feel that you'll be - or *try* not to be *too* locked in.

Not that I'd willy-nilly re-layout and redo my whole base OS and then
put the qemu-kvm atop that again, but ... if I had to - or wanted to,
it wouldn't be all that difficult.  Easier if it's very similar OS, as
I'd know what packages I want, how I want them configured, etc.
But ... all the rest of that?  Like partitioning and how things are
built/layered atop that ... I could significantly - even radically
redo that ... and re-lay down more-or-less what I've already got atop
that, and it wouldn't be all that much different functionally,
but the lower level layout/"plumbing" could be quite a bit different,
... e.g. different sized drives, different hardware (at least within
Debian's amd64 architecture), different partitioning and filesystems,
having or not having and/or changing encryption (LUKS), RAID (mdraid),
LVM, filesystem types (ext2/ext3/ext4/zfs/...) I could quite
substantially change most all that and pretty easily then, atop that,
again lay down more-or-less what I already have.  I've also laid out a
sufficiently flexible architecture, that I can also significantly change
things incrementally, and mostly non-disruptively.  Last year I picked
up a used (new to me) Dell laptop.  Turns out it was in many ways better
than the existing laptop I was using (which notably had experienced
numerous hardware failures and was very seriously failing yet again (GPU
non-replaceable on mainboard failed yet again outside of warranty).  So,
taking a bit closer look at the Dell, it looked like a perfectly good
place to migrate too - similar hardware specs to what I had been using,
but in many ways a fair bit better.  So, ... I effectively did a
body transplant - took my perfectly good SSD out of the cr*p
laptop I'd been using, put it in the Dell in place of the Dell's
spinning rust, booted and ran with it - perfectly fine, no changes
needed.  But ... I *wanted* to make changes ... the Dell was more
capable - it had *two* internal drive slots.  I could've also used the
spinning rust that came with the Dell, but instead I opted to add a 2nd
SSD.  I don't recall the exact ordering of the steps, but my older SSD
is "only" 150G, the newer SSD I purchased ... 1T.  So, how'd I get to
where I wanted?  Approximately:
o put new SSD (1T) in primary drive slot (nominally sda)
o put old SSD (150G) in secondary drive slot (nominally sdb)
o booted from recovery media media, and nothin' mounted, etc. from either
   of the SSDs (fully ro only), image copy from the 150G to the 1T (or
   maybe what I did was add basic boot stuff to the 1T, but have/leave
   it configured to do all but /boot filesystem from the 150G).
o once I had that done, I could - slowly - make most all my other
   changes live!
o essentially what I did was set up the first 150G of both to be RAID-1
   mirrored.  I'd keep all the more important data here.
o everything beyond that first 150G on the 1T was set up for "less
   important" data - not actually protected RAID-1 data.
o but I went even further than that - that not (exactly) RAID-1
   (unprotected) on the 1T (all but the first 150G), I set up slightly
   funkily under mdraid - I set it up as RAID-1 under mdraid ... *but*
   with only one device - which normally would be "degraded" ... but I
   also configured it to have only *one* device on that RAID-1 - notably
   so it wouldn't continually nag me about the missing device in the
   RAID-1.  Why in the heck do that?  Future coolness.  Say
   1/3/6/12/24 months from now I want to dump my 150G and replace it with
   1T (or larger) SSD.  Then I can fully RAID-1 mirror that (up to) 1T
   very conveniently and easily - just tell mdraid that the RAID-1
   nominally has 2 devices (which would be the normal case anyway), add
   the missing device, mdraid mirrors it, and then I'm to a nominal
   mirrored RAID-1.  No complications or outages of having to figure out
   how to "slide" RAID-1 infrastructure under storage that has no RAID
   protection at all.  The only bit I have to be a little careful about,
   is when I use/allocate storage, is it actual real protected RAID-1,
   or is it "fake" RAID-1, where it's effectively degraded (zero
   redundancy).  Since I've got LVM involved too, I could've done it one
   of (at least) 2 different ways.  Separate volume groups - I could have
   different volume groups for different kinds of storage.  That would
   make it more difficult to accidentally allocate/use storage from the
   "wrong type" of storage.  However, that would make it much more
   difficult to move allocated storage from one type to the other.
   Instead, I went different route - I (at least for all that) mostly use
   one volume group.  Key advantage to that - I can dynamically move
   things between different types of storage (notably (real) RAID-1 and
   unprotected (no redundancy)).  I suppose if I wanted to make it harder
   for me to accidentally allocate storage of "wrong type" under that
   scheme, I could also create PV groups - one for real RAID-1, and one
   for unprotected - but I've not bothered.  I do also have program I've
   written that very quickly and easily let's me see what storage I've
   got where - so it's easy to catch if I've put stuff I want protected
   where it's not - or vice versa.  But I probably ought (also) set up
   the PV groups - that would make it even easier, as I wouldn't need to
   check/look to see which PV(s) have space - I could just name the PV
   group when allocating the storage, and LVM would automagically find
   storage of the desired type, without my need to check individual PVs.
   (Sounds like such a good idea, maybe I'll get to it ... soon?)

So, sure, you (Rick) and I, may have significantly differing opinions of
how to ideally lay out a smallish to mediumish sized Linux server
(notably disk layout).  And I certainly wouldn't advocate such
complexity (or anything close to it!) for a newbie.  But ... :-) for
seasoned pro Linux sysadmins :-) ... I'd recommend something that offers
quite a deal of flexibility - including, for the most part, ability to
very substantially change much/most of that, and most of it with zero
downtime.  So, yes, sure, I've got a complex set of layers, ... but it
gives me *lots* of flexibility (and quite a bit of security too).  It's
basically (at least currently):

/boot - is mdraid RAID-1, that and all else on the disks is set so OS can
boot from either just fine.

Everything else is sliced into a fair number of partitions.  Why?
Flexibility.  If I ever want to give some partition(s) over to something
else for whatever reason, that's pretty darn easy to do.  Just move
(generally live) any data on there to elsewhere, deallocate anything(s)
using it, if relevant change partition type, and it's set for whatever
else.  If I need bigger partition(s), I could effectively join adjacent
partition(s) - though I'd likely have to do one reboot in there somewhere
to prevent things (OS/kernel/...) from getting confused (I think in theory
there are way to muck about with that on running kernel - but seems
quite hazardous to me ... maybe if some day I play around with it
enough to figure out how to reliably do it without breaking things ...
in the meantime, won't be doing that on hosts where I care about their
stability/data).

Atop partitions - LUKS encryption (excepting those used for mdraid
RAID-1 for /boot - /boot can't be encrypted).  Really no reason *not* to
encrypt most all drive data these days - it's negligible overhead on
modern CPUs.  And if you need unattended (re)boots, you can make
keys/passphrases available to the boot(/initramfs) process, so it can
decrypt and use ... but you could put those keys on something separate
from the drives - e.g. [micro]SD, USBflash - something small, cheap,
securely backed up elsewhere, and that you're more than willing to
totally destroy if called for for security reasons (one can't really
securely wipe flash - but there are ways to destroy it).  If one wanted,
one could put all of /boot on such a device too - then the drives
would all be encrypted (less the partition table, any drive slack/unused
space outside of the partitions, and some LUKS header data).  With them
all encrypted like that, essentially noting is ever written in the
clear on them.

So, atop LUKS, I've got mdraid - real RAID-1, and unprotected (fake
unprotected single device RAID-1).  And atop that I've got LVM.
LVM is mostly used for LVs (Logical Volumes) which are mostly used for
filesytems.  But some serve other purposes.  E.g. swap.  Yep, and set
up that way, I can dynamically add - or remove - swap.  And suspend to
disk / hibernate actually even works like that - even with the
encryption and all (pleasantly to my surprise - at least on Debian,
anyway - though I mostly don't use that, due to some other (notably
video) glitches I've not bothered to sort out ... yet).

> The last step is to migrate to a VM.  This could even be a _literal_
> copy of the current running system for starters.  The point is that then
> I'd have the breathing room to construct at leisure the next production
> system in a second VM, and flip which of the two is production each flag
> day.

Yep.  Some VM infrastructure/software even provides tools for that.
But on, e.g. Debian, with qemu-kvm, it's almost dead simple.
When you do your VMs, use raw disk image format.  Then you just need
whole image of the drive(s) of the existing physical, configure your
VM to use those images as the drive(s), and you're set.  They can
even be sparse files - and if OS, etc., is "new enough" to support
trim, discard, and similar, it can even use and continue to use
sparse files efficiently - releasing blocks that are no longer needed
all the way through from VM down to the physical layers.

If you have a bunch of similar data for, e.g. VMs, ISOs, etc.
one can use filesystem(s) that do deduplication - you'll save
quite a bit of drive space ... but at a cost of CPU, RAM, and (virtual
drive) I/O performance.  (when I was at "only" 150G physical, I was
aggressively doing that - virtual drive write performance heavily sucked,
but *lots* of physical storage space was saved - wouldn't have otherwise
been feasible to be doing nearly as much with VMs as I was in the
limited amount of physical space I had for them).

So, ... taking that image - if one has the infrastructure under it,
it could be safely done live - e.g. remount filesystems ro, or unmount
where feasible.  Any that need remain rw, do snapshot - e.g. with LVM.
That doesn't give a "clean" filesystem, but gives a "clean enough"
filesystem - it's as if someone pulled the plug on it.  Should be fully
recoverable to consistent state.  If you read a live rw filesystem
end-to-end, you've got zero guarantees that you end up with something
that's even recoverable.  If one doesn't have snapshot capability or
the like, for any rw filesystems that otherwise remain, schedule bit of
an outage.  Take the host down, boot from recovery or similar media.
Don't rw mount any of the host's nominal filesystems.  Do a full image
copy of the drives to suitable location/media.  Then you have the full
drive(s)'s image data - all set to go - don't even have to do any
recovery of any of the filesystems - they should be fully clean.
Can always start VM with virtual disk(s) exactly as they were on
the PV that preceded it.

Could also get there incrementally ... e.g. rsync.  rsync as needed
from physical to image for virtual (notably filesystem data) - note that
one will have to cover partitioning and bits beyond data within the
filesystems by other means (such as an initial image copy).  Can then
periodically - and much more quickly - refresh the target with rsync ...
to guarantee fully consistent refresh to the target, get all the source
filesystem to ro - booting from recovery media if necessary to do so,
... but that rsync update/refresh should (at least generally) be much
faster that full disk image copying.

Anyway, not too hard to go from physical to virtual ... or even vice
versa.  Heck, I've got a USB flash drive that's the storage for,
a host ... is it physical, or is it virtual?  Well, that depends.
I can boot direct from it on hardware - then it's physical.  I can also
boot my VM with that same storage as the drive image - then it's virtual.

Not too many years back, did some software repair of linuxmafia.com
host via similar means ... got disk images (or at least relevant
portions thereof).  Used that as image for VM ... repeatedly, restarting
from same earlier point (copy of that same data) as needed, to work out
a "recipe" to fully fix the software (etc.) state on linuxmafia.com.
Once the "recipe" had been worked out on the virtual, did same to the
physical, and linuxmafia.com was then back up and operational again.

> In addition to supporting subsequent quick cutovers between the
> production VM and the beta VM, this would also get me off the scarily
> ancient PIII hardware, finally.

Yep, one of the great advantages of VMs ... move/copy 'em around, run
'em on non-ancient hardware, emulate most older hardware if/as needed,
etc.

CompuLab Intense PC.  Only one?  With two, or at least two suitable
physical hosts, can migrate VM from one physical host to another.
Can even do live migrations!
So, yep, ... could, with multiple suitable physical hosts (e.g. 2),
move VM from physical host A to physical host B, totally redo physical
host A (or hardware maintenance, or replace/upgrade hardware ...
whatever) ... then migrate the VM back, ... then, if desired, do likewise
to physical host B.  Heck, I've got a VM (balug) that's gone from
physical shared host on desktop tower in a colo, to a Xen VM (maybe
first on that same desktop tower, then) on a 1U host in another colo,
to a qemu-kvm VM, to same on my laptop(!), to bouncing semi-regularly
between that and same on 1U host in my residence, to VM on my now
current laptop (where it also still goes back and forth between that
1U host in my residence semi-regularly).  Oh, and why all the bouncing
back and forth?  That 1U sucker is *loud* (also sucks more power) ...
but that doesn't bother me when I'm not home to hear it, so, when my
laptop needs to go out with me or take an outage, the VM goes to the 1U,
when my laptop is home and up per nominal, the VM goes back to the
laptop and the 1U is taken down.

Anyway ... I've got a fair amount of qemu-kvm experience.  Even have
some handy programs I wrote (or at least one in particular) - I find it
*highly* handy for creating VMs ... from *whatever* (build from ISO
image, build around disk image, PXE boot from network and just run live,
or do install from that, etc.).  At least on Debian, there are some
qemu-kvm (and related) packages that are very handy/useful to have.
Let's see, on Debian stable, I currently have ... (probably not
everything that's relevant, and likely some bits that may not be
relevant ... and probably many/most brought in as dependencies ... also
have a fair bit of QEMU to also support virtual foreign architectures):
Name                     Description
ipxe-qemu                PXE boot firmware - ROM images for qemu
libvirt-bin              programs for the libvirt library
libvirt-clients          Programs for the libvirt library
libvirt-daemon           Virtualization daemon
libvirt-daemon-system    Libvirt daemon configuration files
libvirt-glib-1.0-0:amd64 libvirt GLib and GObject mapping library
libvirt0                 library for interfacing with different  
virtualization systems
python-libvirt           libvirt Python bindings
qemu                     fast processor emulator
qemu-efi                 UEFI firmware for 64-bit ARM virtual machines
qemu-efi                 UEFI firmware for 64-bit ARM virtual machines
qemu-kvm                 QEMU Full virtualization on x86 hardware
qemu-kvm                 QEMU Full virtualization on x86 hardware
qemu-slof                Slimline Open Firmware -- QEMU PowerPC version
qemu-system              QEMU full system emulation binaries
qemu-system-arm          QEMU full system emulation binaries (arm)
qemu-system-common       QEMU full system emulation binaries (common files)
qemu-system-mips         QEMU full system emulation binaries (mips)
qemu-system-misc         QEMU full system emulation binaries (miscellaneous)
qemu-system-ppc          QEMU full system emulation binaries (ppc)
qemu-system-sparc        QEMU full system emulation binaries (sparc)
qemu-system-x86          QEMU full system emulation binaries (x86)
qemu-user                QEMU user mode emulation binaries
qemu-user-binfmt         QEMU user mode binfmt registration for qemu-user
qemu-utils               QEMU utilities
virt-manager             desktop application for managing virtual machines
virt-viewer              Displaying the graphical console of a virtual machine
virtinst                 Programs to create and clone virtual machines





More information about the conspire mailing list