[conspire] non-aviation examples & also: Re: 737 MAX story keeps getting more fractally bad

Sun Jul 14 22:35:17 PDT 2019

> From: "Rick Moen" <rick at linuxmafia.com>
> Subject: Re: [conspire] 737 MAX story keeps getting more fractally bad
> Date: Fri, 5 Jul 2019 12:26:44 -0700

> Quoting Michael Paoli (Michael.Paoli at cal.berkeley.edu):
>
>> So, ... how do we "fix" it?
>
> Short answer:  Fix the FAA.

> full story because of an absurd amount of compartmentalising of
> information.  See:
> https://www.radicalcompliance.com/2019/06/02/another-lesson-from-boeing-silos/

Yep, have seen many problems as result of excessive silo situations.
Mostly significant losses in overall efficiency and general dysfunction,
but silo situations going - especially to excess and not otherwise
counter-balanced or worked through/around where needed/appropriate, can
lead to all kinds of various problems.

Silos - semi-random.  Once upon a time, at $work, ran into quite such
a situation.  There was a fairly complex technical problem - most
notably in that it touched/used so many different components - and so
many different teams for each.  I was brought onto the issue to ("help")
solve it.  It became apparent in exceedingly quick order that the
greatest problem (or obstacle to solving the problem) wasn't at all
technical.  Essentially there were about half a dozen highly siloed
groups.  The ball would be tossed to each of them in turn, to
check/test/examine the area they were responsible for and to fix
any problems there.  And each such team would take the ball, go,
"Not our problem, we checked all our stuff it's fine", and shoot the
ball out with great force, to have it land/dropped into silo of
another team.  This would go on continually, making no real progress.
And also along with that, the teams weren't sharing any information.
Their detail on what they did, tested, found, suspected, etc. was
about as terse as noted above - no specific details.
Well, in quick order, I explained in reasonably appropriate terms what
needed to be done to solve the problem - most notably have these
teams meet (that much had been happening), and have them openly
share and communicate information - most notably anything and everything
they know, and even think or suspect, about the issue, what they
did/didn't test/review, and how, what data they actually got, etc.
And in that environment, in rather quick order, several things happened
from that, the first two about immediate:
o manager responsible for the product / business component highly
   praised me and my communications on the matter, saying, if not
   literally, "That's exactly what needs to be done!"
o I almost instantly got my *ss handed to me and kicked out the door:
   hardened silo environment - much resistance, anything so much as
   insinuating that anything anywhere within the organization was less
   than 100% perfect was typically a quick ticket to a kick out the door.
o the problem got solved, in fairly quick order (<2 weeks, it had been
   going on many many months, if not year(s) (it was an intermittent, but
   rather frequently recurrent and very nasty and impactful problem))
And the first two points above both resulted directly from the exact
same communication I'd sent.

Another silo example from $work:
I was asked to investigate/resolve a problem.
I in very quick order find a huge piling-on of redundant processes.
I track down the responsible crontab entry and code.
I examine the code - it has a slight bug, crontab job kicks it off
hourly, code is *intended* to exit at/after an hour, but the check
is flawed, so it'll never satisfy the check and thus never exit.
I examine and trace back *when* the code was last changed, when/how
that happened ... I trace it back to the responsible user who
placed the code there (or changed it), and when, and from
what host/IP they'd done such from.  That user is from (silo, OMG)
another group.  I politely inform them, essentially,
"Hey, seems we have an issue with <such-and-such> code.
It would appear you placed it there - or modified it <date/time>:
<evidence>.
Not sure who wrote/authored/modified the code, but looks like there's
slight bug in it causing issue.  We have here on
<lines>
looks like that's intended to (exit via check of)
but as it's written that will never do that as:
<how/why the apparently intended fails>
I'm also guessing that this code and bug may not be unique to
exactly and only this one host where I found it.
Could you please follow-up on the issue and see that it gets corrected?
Let me know if you have any questions or wish any assistance.
Thanks.
"
Well, silo(s) - I about got my *ss handed to me on that one for so much
as suggesting that any code anywhere within the organization might
possibly contain a bug, let alone code originated from some other group.
I was also essentially told "never ever ever do anything with any other
group's code, just have them deal with it" ... of course I didn't even
know the origin of the code until I'd tracked down from whence it came -
but by then, "of course", I was highly aware of the nature of the bug in
the code too.  But heaven forbid I find anything wrong in anybody else's
code or so much as mention or even hint at it.  That $work, and its
silos within - was much more concerned about appearances and posturing,
than reality or actual quality (or safety, etc.).  (Yes, they also
included much safety theater after majorly screwing up on safety).

> of that made even a tiny bit of sense from an engineering perspective,
> only from a money-savings corporate-finance one.
>
> But those were not the actions (or inactions) of public servants.  If

Guess I've worked in fairly diverse range of environments, including
some rather to quite heavily regulated.  But maybe not ones where the
$work was also so highly motivated to weaken/destroy its being
regulated, and/or where there was such a significant (at least quite
direct and in core product) public safety component.

Regardless, I often run into cases where I'm quite pushing from
engineering perspective to "do things right", and, well, what I see
that's actually been done or is being done, or even what results
is often substantially to *way* less than something well engineered.
:-/

Some $work examples that jump to mind:

A significant security incident (could've been *much* worse, but was
quite bad enough).  I end up being tasked with a fairly large chunk
of investigating, notably (approximately) what went wrong, where, and
how do we prevent such.
The grossly oversimplified version of it goes about like this:
there was a code bug, it got exploited, it was a known bug with
patch, should'a been patched (much!) earlier.
My analysis, while also including the above, was more like (and
fairly concisely covered in executive summary in my write-up):
o poor/fragile architecture
o general and numerous failures to follow least privilege principle
o excessive trust relationships, e.g.:
   o compromise of application ID on any one host gives easy path to
     compromise same across majority of hosts (>>100), and all such hosts
     in no more than one additional hop via trust relationships.
o lack of isolation, e.g.:
   o one application ID used for almost everything, e.g.:
     o runs almost all application processes
     o owns almost all application data
     o owns almost all application binaries
     o owns almost all application configuration files
     o has access to huge volumes of privileged data
     o has access to lots of authentication data - essentially any bit
       anywhere used by the application
     o authentication is rarely changed and should be better tracked
     o application ID has access way in excess of a regular unprivileged
       user login account - even where components of application are
       doing operations that only need much more limited access, those
       components have access grossly in excess of that needed
   o about zero use of chroot, BSD jails, cgroups, containers, etc.
         o umask values generally/commonly too/excessively lax
   o ownerships/permissions generally excessively lax (e.g in normal
     operations, application ID(s) wouldn't/shouldn't need rw access to
     most or all application binaries, configurations, ability to not
     only create data to be logged, but also remove/truncate/alter most
     any and all such logged data, etc.)
o lack of validation - untrusted user(/client) input should never be
   passed unsanitized to privileged IDs and/or any processing/programs
   that may contain bugs or potentially be compromised/exploited.  There
   were multiple opportunities between raw user input and code that was
   used to process that input, to check/sanitize/validate that input -
   there were zero validation checks along the way - privileged
   application ID was used to directly process
   unchecked/unsanitized/unvalidated direct raw user/client input.
o lack of sufficient logging/monitoring to, e.g., detect breach,
   determine extent of breach, etc.
o lack of testing for various vulnerabilities
o lack of code review (and especially for security) - most developers
   write their own code, write their own tests for it, and it rolls on
   through from development, through self-same testing, and into
   production - any one developer can put darn near anything into
   production without so much as a review of their code by anyone else.
o (much etc. - I just covered above, a few key points that jumped to
   mind from memory)

Security(/engineering) theater, e.g. from some $work, rather than
getting resources/prioritization to deal with security stuff that would
really make a difference (like or similar to example above), priority is
placed on results of a scan tool/vendor, without any intelligence
applied to evaluate scan results as to actual potential risk/impact,
thus, e.g., lots of resources are burned on security theatre - stuff
that doesn't matter - to make someone's reports look good/clean;
meanwhile, actual existing risks that are clearly and repeatedly pointed
out go ignored - 'cause they fail to show on the scan reports.

Multiple occasions and ${work}s[!], and generally production:
I write - or do a code review and substantially rewrite - making code
substantially robust/secure - or much more than it was, squashing
numerous hazards - including typically security issues also.  And ...
what I do ends up quite squashed, e.g. stuff like/approximating (not
literal quotes):
"That's too complicated, we don't need all those checks - those things
would never happen - I'm stripping all those out.  Nobody would ever
accidentally run the program a second time, these dozens of qualified
sysadmins we have running this code on many thousands of hosts, they'll
only run it exactly once on each host, per the documented procedure,
they won't make that mistake, and nobody would ever run it on the wrong
host or operating system - no need for the code to check any of that."
- and stuff like that of course for production & critical infrastructure
(at least to $work).
"Naw, I like my original version much better.  We'll implement that.  We
don't need those checks."
"We don't have to worry about race conditions - those are unlikely."
"We don't have to be (that) secure with the temporary files, nobody
would ever guess that or read our code." (or trace our running stuff,
etc.)
"We don't need to encrypt that.  Our network is secure."
"Only (>>100,000) employees/contractors/vendors have access to our
network."
"We have a firewall."  (Otherwise known as:
"Hard crunchy outside, big soft chewy middle.")

> Again, the one bit of essential background reading in all of this, IMO,
> is the IEE Spectrum article, which please see:
> https://spectrum.ieee.org/aerospace/aviation/how-the-boeing-737-max-disaster-looks-to-a-software-developer

Yes, excellent - read it earlier.