[conspire] Feynman's wisdom (was: 737 MAX story keeps getting more fractally bad)

Sun Jul 14 21:47:01 PDT 2019

I wrote:

> There's something more fundamental that might be getting lost here,
> though:  A dynamically unstable passenger jet is just not OK.  
[...]
> Attempting to kludge away that gross design defect by making
> software counteract the craft's power-pitch coupling problem is, um,
> nuts.  And thus, it needs to be said:  The software fix Boeing is
> currently trying to sell will ignore the big underlying problem, that
> a dynamically unstable passenger jet is just not something that should
> ever be approved and taken to market.  _That_ is a red flag.

In chatting about this matter with my friend Duncan MacKinnon today,
I was reminded about Nobel physicist Dr. Richard P. Feynman's famous,
timeless independent report[1] on the space shuttle Challenger disaster
-- a profoundly insightful piece whose many fans include top software
security expert and BSD notable Marcus J. Ranum:

  Feynman's "Observations on the reliability of the shuttle" should be
  required reading for anyone who works in an engineering field, because
  it is the very juice squeezed from pearls of wisdom.  Clearly and
  apparently effortlessly, Feynman weaves together an explanation of how
  testing, safety, process control, and engineering discipline fit
  together.  Think about that for a second. Feynman writes a reference
  masterpiece in eleven pages.

  When I read Feynman's "Observations" I cannot help but think about
  computer security and how virtually everything Feynman has to say about
  how _not_ to do engineering, and how to achieve safety, is right there
  in black and white.  In fact, every time I re-read "Observations" I lose
  my temper, because I am confronted with a clear, conclusive argument
  that, in computing, we continue to do things that are really, really
  stupid.  It is ironic that in "Observations" the only kudos Feynman has
  to give are to the software development practices NASA used for the
  shuttle.  Compare his description of that process with how 100% of
  commercial code is written today - you may find it instructive.

  When the Columbia exploded on re-entry (things moving 12,000mph don't
  "break up", they explode) I could hear the ghost of Richard Feynman
  yelling "dang it!" from wherever physicists go when they die.  In
  "Observations", Feynman is careful to point out, in his discussion of
  "safety margin", that, if hot gasses are not _designed_ to jet past
  and damage an O-ring, then there is no such thing as a "safe" flight
  in which that happens _to any degree_.  You could take "Observations"
  and staple a post-it note to the front of it reading, "...and if the
  design doesn't say that heat resistant tiles are _supposed_ to fall
  off, then there is no 'safety margin' that justifies flying the
  shuttle if they do."

https://www.ranum.com/editorials/must-read/

For those who haven't yet read Feynman's insightful short paper, a
cached copy at Ranum's site is here:
https://www.ranum.com/security/computer_security/editorials/dumb/feynman.html

Here's part of what Ranum is talking about, quoted from Feynman's
eleven pages:

  The phenomenon of accepting for flight seals that had shown erosion
  and blow-by in previous flights, is very clear. The Challenger flight is
  an excellent example.  There are several references  [in its
  certification record and Flight Readiness Reviews]  to flights that had
  gone before.  The acceptance and success of these flights is taken as
  evidence of safety.  But erosion and blow-by are not what the design
  expected.  They are warnings that something is wrong.  The equipment is
  not operating as expected, and therefore there is a danger that it can
  operate with even wider deviations in this unexpected and not thoroughly
  understood way.  The fact that this danger did not lead to a catastrophe
  before is no guarantee that it will not the next time, unless it is
  completely understood.  When playing Russian roulette, the fact that the
  first shot got off safely is little comfort for the next.  The origin and
  consequences of the erosion and blow-by were not understood.  They did
  not occur equally on all flights and all joints; sometimes more, and
  sometimes less.  Why not sometime, when whatever conditions determined it
  were right, still more leading to catastrophe?

  In spite of these variations from case to case, officials behaved as if
  they understood it, giving apparently logical arguments to each other
  often depending on the "success" of previous flights.  For example. in
  determining if flight 51-L was safe to fly in the face of ring erosion
  in flight 51-C, it was noted that the erosion depth was only one-third
  of the radius.  It had been noted in an experiment cutting the ring, that
  cutting it as deep as one radius was necessary before the ring failed.
  Instead of being very concerned that variations of poorly understood
  conditions might reasonably create a deeper erosion this time, it was
  asserted taht there was "a safety factor of three."  This is a strange use of
  the engineer's term "safety factor."  If a bridge is built to withstand
  a certain load without the beams permanently deforming, cracking, or
  breaking, it may be designed for the materials used to actually stand up
  under three times the load.  This "safety factor" is to allow for
  uncertain excesses of load, or unknown extra loads, or weaknesses in the
  material that might have unexpected flaws, etc.  If now the expected load
  comes onto the new bridge, and a crack appears in a beam, this is a
  failure of the design.  There was no safety factor at all; even though
  the bridge did not actually collapse because the crack went only
  one-third of the way through the beam.  The O-rings of the Solid Rocket
  Boosters were not designed to erode.  Erosion was a clue that something
  was wrong.  Erosion was not something from which safety can be inferred.

As Ranum says, Feynman's point here is so profound, yet so artlessly
stated, one really must slow down and read it carefully, perhaps a few
times, before his full meaning can sink in:  Something failing partially
that's _not supposed to be able to fail that way at all_ isn't 'safe'.

To apply that insight to the 737 MAX, designing and certifying for
service an passenger aircraft that's dynamically unstable is profoundly
wrong, something the IEEE Spectrum author, Gregory Travis, said
'violated that most ancient of aviation canons'.  Papering over that 
fundamental hardware flaw with software and calling the result OK 
was a gross abandonment of aviation fundamentals, revealing that the 
bad airframe design is actually the lesser problem:  The fact that such 
a thing was proposed, its nature and flaws partially concealed (from
pilots, the FAA, and others), and that it was approved is even more
worrisome.

Feynman's quietly understated overall point is that NASA's management
oversight was grossly incompetent and killed the Challenger crew, even
though it meticulously followed (wrong) procedures.  Travis echoes
Feynman explitly (although not naming him) in his IEEE Spectrum article
about the 737 MAX disaster:

  I cannot get the parallels between the 737 Max and the space shuttle
  Challenger out of my head.  The Challenger accident, another textbook
  case study in normal failure, came about not because people didn’t
  follow the rules but because they did.  In the Challenger case, the rules
  said that they had to have prelaunch conferences to ascertain flight
  readiness.  It didn’t say that a significant input to those conferences
  couldn’t be the political considerations of delaying a launch.  The
  inputs were weighed, the process was followed, and a majority consensus
  was to launch.  And seven people died.

  In the 737 Max case, the rules were also followed.  The rules said you
  couldn’t have a large pitch-up on power change and that an employee of
  the manufacturer, a DER, could sign off on whatever you came up with to
  prevent a pitch change on power change.  The rules didn’t say that the
  DER couldn’t take the business considerations into the decision-making
  process.  And 346 people are dead.

And yes, I've quoted that passage in this thread before, but forgot that
it echoes the _exact some_ point Richard Feynman famously made 33 years
ago.  Apparently, we're still re-learning that lesson.

[1] Feynman was a prestige appointee to the 1986 Rogers Commission,
during whose proceedings he found that the Washington committee process
was designed to sweep under the carpet the reservations of
independent-minded critics, and basically was pressured not to rock the
boat.  The committee was all set to ignore his findings, so Feynman 
non-chalantly dropped mention that, in that case, he'd just publish his
own findings as a separate report, and remove his name from whatever
junk the rest of the committee chose to publish -- the findings of one
of the world's most famous scientists being difficult to ignore.
Chairman and Washington functionary William P. Rogers, recognising a
immovable object when he saw it and not having an irresistable force
handy, let Feynman publish his 'Personal Observations on the Reliability
of the Shuttle' as Appendix F of the Rogers Commission report to
President Reagan, and it's the only part that history has taken
particularly seriously.

Ranum includes a comment about this as a footnote to his 'Two Great
Articles' piece (https://www.ranum.com/editorials/must-read/#1) some
further comments Feynman made about the way the committee tried and
failed to hustle him, in a 2005 omnibus book that bundled his prior
popular books _Surely You're Joking, Mr. Feynman!_ and _What Do You Care
What Other People Think?_ with some new material:
https://www.amazon.com/gp/product/0393061329/sr=8-4/qid=1147370262

It turned out that Feynman might not have been as effective a critic if
not for some carefully low-key help pointing him in the direction of the
now-infamous O-ring problems by fellow commission member General Donald
J. Kutyna (USAF).  Kutyna in turn said he'd been clued to the O-ring
problem by fellow commission member and astronaut (and fellow PhD
physicist) Dr. Sally Ride.
https://en.wikipedia.org/wiki/Donald_J._Kutyna#Los_Angeles_Air_Force_Base_and_Space_Shuttle_program

Neither Kutyna nor Ride felt comfortable rocking the boat to correct the
commission getting hustled to a dumb bureaucratic waste of time, but
were smart enough to trust Feynman to be the necessary bull in the
china shop.

And I should mention another little-appreciated hero:  Roger Boisjoly, 
engineer at Morton Thiokol, which manufactured the shuttle's solid
rocket boosters, went to Morton Thiokol management months
before the Challenger disaster with strenuous objections to the
Challenger launch, correctly predicting based on earlier flight data
that failure of the O-rings might cause the craft to fail
catastrophically if it were launched in could weather -- as it was
during the disastrous flight several months later in January 1986.  
Management overruled Boisjoly and his team's dire warnings.  To quote
the 2012 NY Times obituary:  'Jerry Mason, Thiokol’s general manager,
told his fellow executives to take off their engineering hats and put on
management hats.  They told NASA it was a go.'

Hey, just like Boeing.

His price for being correct was that his colleagues and managers at 
Morton Thiokol shunned him and he was transferred away from space work.
The only public figure who ever supported him was Dr. Sally Ride, who 
hugged him after his appearance at the Rogers Commission, as a public
show of support.
https://www.nytimes.com/2012/02/04/us/roger-boisjoly-73-dies-warned-of-shuttle-danger.html

Dr. Ride died the same year as Boijoly (both, of cancer) -- two great
Americans, sadly missed.  And only one of them had been able to marry 
that person's life partner, because Dr. Ride's, her partner of 33 years,
was female, and this was before marriage equality.  (She was the first
known LGBT astronaut.)