[conspire] Feynman's wisdom (was: 737 MAX story keeps getting more fractally bad)
Rick Moen
rick at linuxmafia.com
Sun Jul 14 21:47:01 PDT 2019
I wrote:
> There's something more fundamental that might be getting lost here,
> though: A dynamically unstable passenger jet is just not OK.
[...]
> Attempting to kludge away that gross design defect by making
> software counteract the craft's power-pitch coupling problem is, um,
> nuts. And thus, it needs to be said: The software fix Boeing is
> currently trying to sell will ignore the big underlying problem, that
> a dynamically unstable passenger jet is just not something that should
> ever be approved and taken to market. _That_ is a red flag.
In chatting about this matter with my friend Duncan MacKinnon today,
I was reminded about Nobel physicist Dr. Richard P. Feynman's famous,
timeless independent report[1] on the space shuttle Challenger disaster
-- a profoundly insightful piece whose many fans include top software
security expert and BSD notable Marcus J. Ranum:
Feynman's "Observations on the reliability of the shuttle" should be
required reading for anyone who works in an engineering field, because
it is the very juice squeezed from pearls of wisdom. Clearly and
apparently effortlessly, Feynman weaves together an explanation of how
testing, safety, process control, and engineering discipline fit
together. Think about that for a second. Feynman writes a reference
masterpiece in eleven pages.
When I read Feynman's "Observations" I cannot help but think about
computer security and how virtually everything Feynman has to say about
how _not_ to do engineering, and how to achieve safety, is right there
in black and white. In fact, every time I re-read "Observations" I lose
my temper, because I am confronted with a clear, conclusive argument
that, in computing, we continue to do things that are really, really
stupid. It is ironic that in "Observations" the only kudos Feynman has
to give are to the software development practices NASA used for the
shuttle. Compare his description of that process with how 100% of
commercial code is written today - you may find it instructive.
When the Columbia exploded on re-entry (things moving 12,000mph don't
"break up", they explode) I could hear the ghost of Richard Feynman
yelling "dang it!" from wherever physicists go when they die. In
"Observations", Feynman is careful to point out, in his discussion of
"safety margin", that, if hot gasses are not _designed_ to jet past
and damage an O-ring, then there is no such thing as a "safe" flight
in which that happens _to any degree_. You could take "Observations"
and staple a post-it note to the front of it reading, "...and if the
design doesn't say that heat resistant tiles are _supposed_ to fall
off, then there is no 'safety margin' that justifies flying the
shuttle if they do."
https://www.ranum.com/editorials/must-read/
For those who haven't yet read Feynman's insightful short paper, a
cached copy at Ranum's site is here:
https://www.ranum.com/security/computer_security/editorials/dumb/feynman.html
Here's part of what Ranum is talking about, quoted from Feynman's
eleven pages:
The phenomenon of accepting for flight seals that had shown erosion
and blow-by in previous flights, is very clear. The Challenger flight is
an excellent example. There are several references [in its
certification record and Flight Readiness Reviews] to flights that had
gone before. The acceptance and success of these flights is taken as
evidence of safety. But erosion and blow-by are not what the design
expected. They are warnings that something is wrong. The equipment is
not operating as expected, and therefore there is a danger that it can
operate with even wider deviations in this unexpected and not thoroughly
understood way. The fact that this danger did not lead to a catastrophe
before is no guarantee that it will not the next time, unless it is
completely understood. When playing Russian roulette, the fact that the
first shot got off safely is little comfort for the next. The origin and
consequences of the erosion and blow-by were not understood. They did
not occur equally on all flights and all joints; sometimes more, and
sometimes less. Why not sometime, when whatever conditions determined it
were right, still more leading to catastrophe?
In spite of these variations from case to case, officials behaved as if
they understood it, giving apparently logical arguments to each other
often depending on the "success" of previous flights. For example. in
determining if flight 51-L was safe to fly in the face of ring erosion
in flight 51-C, it was noted that the erosion depth was only one-third
of the radius. It had been noted in an experiment cutting the ring, that
cutting it as deep as one radius was necessary before the ring failed.
Instead of being very concerned that variations of poorly understood
conditions might reasonably create a deeper erosion this time, it was
asserted taht there was "a safety factor of three." This is a strange use of
the engineer's term "safety factor." If a bridge is built to withstand
a certain load without the beams permanently deforming, cracking, or
breaking, it may be designed for the materials used to actually stand up
under three times the load. This "safety factor" is to allow for
uncertain excesses of load, or unknown extra loads, or weaknesses in the
material that might have unexpected flaws, etc. If now the expected load
comes onto the new bridge, and a crack appears in a beam, this is a
failure of the design. There was no safety factor at all; even though
the bridge did not actually collapse because the crack went only
one-third of the way through the beam. The O-rings of the Solid Rocket
Boosters were not designed to erode. Erosion was a clue that something
was wrong. Erosion was not something from which safety can be inferred.
As Ranum says, Feynman's point here is so profound, yet so artlessly
stated, one really must slow down and read it carefully, perhaps a few
times, before his full meaning can sink in: Something failing partially
that's _not supposed to be able to fail that way at all_ isn't 'safe'.
To apply that insight to the 737 MAX, designing and certifying for
service an passenger aircraft that's dynamically unstable is profoundly
wrong, something the IEEE Spectrum author, Gregory Travis, said
'violated that most ancient of aviation canons'. Papering over that
fundamental hardware flaw with software and calling the result OK
was a gross abandonment of aviation fundamentals, revealing that the
bad airframe design is actually the lesser problem: The fact that such
a thing was proposed, its nature and flaws partially concealed (from
pilots, the FAA, and others), and that it was approved is even more
worrisome.
Feynman's quietly understated overall point is that NASA's management
oversight was grossly incompetent and killed the Challenger crew, even
though it meticulously followed (wrong) procedures. Travis echoes
Feynman explitly (although not naming him) in his IEEE Spectrum article
about the 737 MAX disaster:
I cannot get the parallels between the 737 Max and the space shuttle
Challenger out of my head. The Challenger accident, another textbook
case study in normal failure, came about not because people didn’t
follow the rules but because they did. In the Challenger case, the rules
said that they had to have prelaunch conferences to ascertain flight
readiness. It didn’t say that a significant input to those conferences
couldn’t be the political considerations of delaying a launch. The
inputs were weighed, the process was followed, and a majority consensus
was to launch. And seven people died.
In the 737 Max case, the rules were also followed. The rules said you
couldn’t have a large pitch-up on power change and that an employee of
the manufacturer, a DER, could sign off on whatever you came up with to
prevent a pitch change on power change. The rules didn’t say that the
DER couldn’t take the business considerations into the decision-making
process. And 346 people are dead.
And yes, I've quoted that passage in this thread before, but forgot that
it echoes the _exact some_ point Richard Feynman famously made 33 years
ago. Apparently, we're still re-learning that lesson.
[1] Feynman was a prestige appointee to the 1986 Rogers Commission,
during whose proceedings he found that the Washington committee process
was designed to sweep under the carpet the reservations of
independent-minded critics, and basically was pressured not to rock the
boat. The committee was all set to ignore his findings, so Feynman
non-chalantly dropped mention that, in that case, he'd just publish his
own findings as a separate report, and remove his name from whatever
junk the rest of the committee chose to publish -- the findings of one
of the world's most famous scientists being difficult to ignore.
Chairman and Washington functionary William P. Rogers, recognising a
immovable object when he saw it and not having an irresistable force
handy, let Feynman publish his 'Personal Observations on the Reliability
of the Shuttle' as Appendix F of the Rogers Commission report to
President Reagan, and it's the only part that history has taken
particularly seriously.
Ranum includes a comment about this as a footnote to his 'Two Great
Articles' piece (https://www.ranum.com/editorials/must-read/#1) some
further comments Feynman made about the way the committee tried and
failed to hustle him, in a 2005 omnibus book that bundled his prior
popular books _Surely You're Joking, Mr. Feynman!_ and _What Do You Care
What Other People Think?_ with some new material:
https://www.amazon.com/gp/product/0393061329/sr=8-4/qid=1147370262
It turned out that Feynman might not have been as effective a critic if
not for some carefully low-key help pointing him in the direction of the
now-infamous O-ring problems by fellow commission member General Donald
J. Kutyna (USAF). Kutyna in turn said he'd been clued to the O-ring
problem by fellow commission member and astronaut (and fellow PhD
physicist) Dr. Sally Ride.
https://en.wikipedia.org/wiki/Donald_J._Kutyna#Los_Angeles_Air_Force_Base_and_Space_Shuttle_program
Neither Kutyna nor Ride felt comfortable rocking the boat to correct the
commission getting hustled to a dumb bureaucratic waste of time, but
were smart enough to trust Feynman to be the necessary bull in the
china shop.
And I should mention another little-appreciated hero: Roger Boisjoly,
engineer at Morton Thiokol, which manufactured the shuttle's solid
rocket boosters, went to Morton Thiokol management months
before the Challenger disaster with strenuous objections to the
Challenger launch, correctly predicting based on earlier flight data
that failure of the O-rings might cause the craft to fail
catastrophically if it were launched in could weather -- as it was
during the disastrous flight several months later in January 1986.
Management overruled Boisjoly and his team's dire warnings. To quote
the 2012 NY Times obituary: 'Jerry Mason, Thiokol’s general manager,
told his fellow executives to take off their engineering hats and put on
management hats. They told NASA it was a go.'
Hey, just like Boeing.
His price for being correct was that his colleagues and managers at
Morton Thiokol shunned him and he was transferred away from space work.
The only public figure who ever supported him was Dr. Sally Ride, who
hugged him after his appearance at the Rogers Commission, as a public
show of support.
https://www.nytimes.com/2012/02/04/us/roger-boisjoly-73-dies-warned-of-shuttle-danger.html
Dr. Ride died the same year as Boijoly (both, of cancer) -- two great
Americans, sadly missed. And only one of them had been able to marry
that person's life partner, because Dr. Ride's, her partner of 33 years,
was female, and this was before marriage equality. (She was the first
known LGBT astronaut.)
More information about the conspire
mailing list