[conspire] VMs, qemu-kvm, ...

Fri Dec 21 21:58:49 PST 2018

Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):

> Well, for the most part ...
> 
> SSD-appropriate not a whole lot of changes on that, other than mostly
> much better performance and much less power/heat, and no noise.

Well, no, there's more to that (some of which you got around to further
down despite saying that, to give credit where due).  I cannot cover
that comprehensively here, and really need to instead do an offline
outline of site design, but here are some points:

1.  One needs to make sure TRIM is taken care of, e.g., doing periodic
fstrim operations.  This is vital over the longer term for SSD
performance, and simply didn't exist for hard drives.  (Early on,
continuous TRIM using the 'discard' mount option was recommended, but is
no more, for muliple reasons including problems with some vendors' SSDs
including Samsung's).)

2.  It's profoundly and pervasively important for good system design
that the hard drive concept of 'seeking' doesn't exist for SSDs.
(Rotational latency doesn't exist either, but complete absence of
seeking has a more-profound effect.)  SSDs are random-access:  Every
sector takes exactly the same amount of time to reach as every other
sector.  No physical moving of parts must occur to reach a storage
location, there being no moving parts to impose that complication.

Because of the above, all the tricks I used for decades to order
filesystems in a way to minimise average seek distance and seek time 
are obsolete on SSDs.  So, one's optimal partition map will look
entirely different.  

3.  As I'm pretty sure I  mentioned, aligning partition boundaries to
the SSD's erase block size is vital to performance and SSD longevity.
(I hear that all modern partitioning tools assist with this.)

4.  I'll want ext4 (since I'm not up for btrfs or ZFS).  Some filesystems
can get noatime (vs. default relatime), or chattr +A on specific files
and directories.  And I'll need to take a renewed look at the other
mount options (and mkfs options such as no ext4 journal).

5.  Greater use of tmpfs and no use of swap is called for.  And maybe a
persistent RAMdisk with periodic sync, e.g., maybe for the logfiles.
(The better solution in the long term for logs is a remote loghost, but
I'm not there yet.)  Need to consider what in addition to /tmp and 
/var/tmp to mount as tmpfs.

6.  Switch to the 'deadline' low-latency I/O scheduler.

I _think_ that's the full set of SSD-specific Linux concerns, but the
point is that there's significant value in reviewing this matter and
geting it right.  For decades, I heard people confidently tell me I was
being silly in being careful and methodical in setting specific
filesystems suitable for purpose, mountpoints, and filesystem
arrangements to minimise hard disk wear and improve performance -- but I
notice I got better results including their hard drives suffering early
failure and mine not.

> SSD/[{micro,mini}]SD/USB flash/etc. - I wish there was - is anyone aware
> of, any tools for/on Linux that give the actual physical write block
> size for various flash devices?  Most notably I want to know what's the
> *smallest* block size that can be efficiently overwritten - for flash,
> "overwrite" requires an erase/write cycle - for all the write block(s)
> impacted ... and if the write is less than the block size or not block
> aligned, that's inefficient, and increases wear on the flash.

This is a troubling question, because the clear trend in commodity x86
hardware has always been towards hardware that conceals hardware
essentials and presents a convenient abstraction to OSes, for the simple 
that 85%% will run on Redmondware.

What I read is that 'erase blocks' are 128kB or 256kB, depending on the
SSD generation -- and that appears to be the literal answer to your
question.  An erase block is typically 64 consecutive 'pages'.  Page
size again depends on SSD generation: 2kB for the very earliest
(2007-2008), then 4kB around 2009, then 16kB starting around 2011.  But
how does one probe this from software?  No idea.

> Yes, you'll want to spend some time planning the layout, and
> configuration, and implementing, etc.
> But don't feel that you'll be - or *try* not to be *too* locked in.

I'd really rather not run a significant risk of needing to blow away the
fundamentals of the host-OS environment in order to re-do it.  My basic 
idea is to have the host environment be very sparse, very stable (e.g.,
locally compiled, hardened kernel with all essential code compiled in
statically, classic static /dev with no udev, thanks), very carefully
controlled, etc.  _And conceptually simple_, which helps the stability
and keeps things easy to understand, which is the key to security and
reliability.

> Since I've got LVM involved too, I could've done it one of (at least)
> 2 different ways.

Above is a key to why I've consistently chosen to eschew LVM/LVM2:
It's yet another abstraction layer that doesn't justify its added
complexity by solving a problem I need to solve.  I've not encountered
the need to grow/shrink filesysyems.  Instead, I just plan well enough.
The one time I needed to alter filesystems in production was during the
summer of rolling blackouts (2001), when ext2 ruled the roost and housed
my data.  Getting tired of my Debian server booting after a blackout to
a single-user fsck prompt while I was at work, I built XFS utils and a
kernel supporting it, scheduled downtime on a weekend, and, on an
offline quiescent system, moved data around (one filesystem at a time) 
using rsync and some normally unallocated HD space until all filesystems
were on XFS, and then rebooted to full operating mode.

I really don't need an extra FS abstraction layer, because I see little
value in 'dynamic reallocation' vs. just a couple of hours of downtime
and simple use of rsync.  Therefore, I don't want it.

> So, sure, you (Rick) and I, may have significantly differing opinions of
> how to ideally lay out a smallish to mediumish sized Linux server
> (notably disk layout).  And I certainly wouldn't advocate such
> complexity (or anything close to it!) for a newbie.  But ... :-) for
> seasoned pro Linux sysadmins :-) ... I'd recommend something that offers
> quite a deal of flexibility - including, for the most part, ability to
> very substantially change much/most of that, and most of it with zero
> downtime.  So, yes, sure, I've got a complex set of layers, ... but it
> gives me *lots* of flexibility (and quite a bit of security too).  It's
> basically (at least currently):

Booting to a maintenance session (live distro of otherwise) and just
rebuilding what needs rebuilding (after safeguarding files) is more than
enough of that flexibility for me, and doesn't come at the high cost of
IMO totally unjustifiable system complexity.  So, absolutely no for this
use-case.

> /boot - is mdraid RAID-1, that and all else on the disks is set so OS can
> boot from either just fine.
> 
> Everything else is sliced into a fair number of partitions.  Why?
> Flexibility.  

I have my own notions of proper server partitioning, and am working on
revising them for SSD-only operation.

I've been pondering how to do the guest-OS storage.  To my
disappointment, I think I'll not be able to use devices pass-through, as
it seems to be intended to be used to pass through whole LUNs.  Pity,
that, as otherwise I could have md0 on the host house
the entire small, tight host, and then the next several mdN devices pass
through to the first guest, and an equal number pass through to the
second guest.  However, I'm not _sure_ it's impractical to pass through
/dev/sdXN.  This page demonstrates doing it with Red Hat's graphical
virt-manager thing:
https://plus.google.com/103618577142405188604/posts/bUCTbUjfQaj
If it can be done with virt-manager, which is just a bloatware front-end
to libvirt, then it can be done with other libvirt-compatible toolsets
such as virtinst.  
https://www.cyberciti.biz/faq/install-kvm-server-debian-linux-9-headless-server/

I will know more about this after some experimentation.

> Atop partitions - LUKS encryption (excepting those used for mdraid
> RAID-1 for /boot - /boot can't be encrypted).  Really no reason *not* to
> encrypt most all drive data these days - it's negligible overhead on
> modern CPUs.

I disagree.  Simplicity.

Complexity must justify itself as needed for solving a worthwhile
problem that I've decided to address.  That is not the case, for my
server use-case.

> So, atop LUKS, I've got mdraid - real RAID-1, and unprotected (fake
> unprotected single device RAID-1).  And atop that I've got LVM.

I heartily concur with the value proposition for my server use cases of
md-driver RAID1.  I do not perceive a compelling value proposition for
my server use cases of LVM/LVM2.  The abilities it adds are ones I feel
I can accomplish in other ways without LVM's significant increase to
system complexity.

> Yep.  Some VM infrastructure/software even provides tools for that.
> But on, e.g. Debian, with qemu-kvm, it's almost dead simple.
> When you do your VMs, use raw disk image format.

Yes, this is obviously the path of least resistance.  However, I'll look
into passthrough first.

> CompuLab Intense PC.  Only one?

Only one fully server-appropriate set of hardware -- because I'm not
like the billionaire industrialist S. R. Hadden in _Contact_ whose motto
was 'Why build one, when you can build two at twice the price?'.
However, I also now own two other Zotac mini-ITX bitty-boxes.  They
don't top out with as much RAM, and -- more damningly -- are woefully
deficient in ability to support md-driver RAID1 on anything besides USB.

IIRC, one of those boxes does have an M.2 socket in addition to a 2.5"
drive bay, so that actually isn't hopeless.  But the Zotac units need
testing, in any event.

> Anyway ... I've got a fair amount of qemu-kvm experience.

Cool.  I might yelp for help, at some points.