[conspire] Y2038 reminder

Rick Moen rick at linuxmafia.com
Wed Jan 22 17:15:43 PST 2020


A Twitter thread is reproduced below, but first some context:  

Unix timestamps are stored as a signed integer, for which the beginning
of epoch is...

:r! date -u -d @0

Thu Jan  1 00:00:00 UTC 1970

(Presumably, this was a revision from the initial handling of dates in
primordial Unix, since Thompson and Ritchie made Unix first run in August
1969 on a DEC PDP-7 -- which, weirdly, was an 18-bit machine, BTW.[1])  The
standard date and time functions for C on Unix are part of libc, and
furnish variable time_t to report the number of seconds since beginning
of epoch.  So, you might ask, when's the end of epoch?  Answer:  It
depends -- on how big an integer may be.  The Unix world got very
accustomed to _32-bit_ computing starting with the DEC VAX (mid-'70s),
Motorola 68k series (a few years after), Intel IA-32, SPARC, MIPS, PPC,
HP PA-RISC, etc.  

On 32-bit, an _unsigned_ integer may be as large as 2^32 - 1 =
4,294,967,295, but a _signed_ integer can be only as large as 2^(32-1) - 1 =
2,147,483,647.  So, on 32-bit Unix, the _end_ of epoch is:

:r! date -u -d @2147483647
Tue Jan 19 03:14:07 UTC 2038

So, on 32-bit Unix (including 32-bit Linux), dates roll over early on
the morning of January 19, 2038, erroneously restarting the clock at
2^31 - 1 seconds _before_ beginning of epoch, which is...

:r! date -u -d at -2147483647
Fri Dec 13 20:45:53 UTC 1901

'Oh, but 18 years in the future is a long ways off', is a common
reaction -- which is true until you try to do (with a 32-bit-compiled
Unix program) _anything involving dates past end of epoch_ -- like, say,
a mortgage amortisation table for a 20- or 30-year mortgage.  Oops.


Recompile and run on 64-bit Unix for a much more spacious time_t, you
say.  Good idea.  Talking about kicking the problem down a long, long
road.  Here's the revised end-of-epoch calculation:

2^(64-1) - 1 = 9,223,372,036,854,775,807 seconds since beginning of
epoch.

Assuming 365.2425 days per year on average, there are 365.2425 × 24 × 60
× 60 = 31,556,952 seconds/year.  (A bit more in leap years, a bit fewer
in non-leap years, but we're about to average over a rather large number
of future years, so it evens out.)

9,223,372,036,854,775,807 total seconds since the 1970 AD epoch divided by
31,556,952 average seconds/year = 292,277,024,626.927714913658 seconds, 
or the year 292,277,026,596 AD -- kicking the problem 292 billion years
down the road.  For comparison, the sun will run out of hydrogen around
4-5 billion years from now.


But, problem:  There's still a lot of old code, and 32-bit CPUs, out
there.  https://en.wikipedia.org/wiki/Year_2038_problem




https://twitter.com/stderrdk/status/1219235747754057733

  John Feminella @jxxf 1:30 PM · Jan 19, 2020

  As of today, we have about eighteen years to go until the Y2038 problem
  occurs.

  But the Y2038 problem will be giving us headaches long, long before 2038
  arrives.

  I'd like to tell you a story about this.




  One of my clients is responsible for several of the world's top 100
  pension funds.

  They had a nightly batch job that computed the required contributions,
  made from projections 20 years into the future.

  It crashed on January 19, 2018 — 20 years before Y2038.




  No one knew what was wrong at first.

  This batch job had never, ever crashed before, as far as anyone
  remembered or had logs for.

  The person who originally wrote it had been dead for at least 15 years,
  and in any case hadn't been employed by the firm for decades.


  The program was not that big, maybe a few hundred lines.

  But it was fairly impenetrable — written in a style that favored
  computational efficiency over human readability.

  And of course, there were zero tests.




  As luck would have it, a change in the orchestration of the scripts that
  ran in this environment had been pushed the day before.

  This was believed to be the culprit. Engineering rolled things back to
  the previous release.

  Unfortunately, this made the problem worse.




  You see, the program's purpose was to compute certain contribution rates
  for certain kinds of pension funds.

  It did this by writing out a big CSV file. The results of this CSV file
  were inputs to other programs.

  Those ran at various times each day.




  Another program, the benefits distributor, was supposed to alert people
  when contributions weren't enough for projections.

  It hadn't run yet when the initial problem occurred. But it did now.




  Noticing that there was no output from the first program since it had
  crashed, it treated this case as "all contributions are 0".

  This, of course, was not what it should do.

  But no one knew it behaved this way since, again, the first program had
  never crashed.




  This immediately caused a massive cascade of alert emails to the
  internal pension fund managers.

  They promptly started flipping out, because one reason contributions
  might show up as insufficient is if projections think the economy is
  about to tank.




  The firm had recently moved to the cloud and I had been retained to
  architect the transition and make the migration go smoothly.

  They'd completed the work months before. I got an unexpected text from
  the CIO:

     CIO:  sorry to bother you, we have a huge problem
     CIO:  s1X.  can you fly in this afternoon?




  S1X is their word for "worse than severity 1 because it's cascading
  *other* unrelated parts of the business". 

  There had only been one other S1X in twelve months.




  I got onsite late that night. We eventually diagnosed the issue by
  firing up an environment and isolating the script so that only it was
  running.

  The problem immediately became more obvious; there was a helpful error
  message that pointed to the problematic part.




  We were able to resolve the issue by hotpatching the script.

  But by then, substantive damage had already been done because
  contributions hadn't been processed that day.

  It cost about $1.7M to manually catch up over the next two weeks.




  The moral of the story is that Y2038 isn't "coming".

  It's *already here*. Fix your stuff.




  Postscript: there's lots more that I think would be interesting to say
  on this matter that won't fit in a tweet.

  If you're looking for speakers at your next conference on this topic,
  I'd be glad to expound further. I don't want to be cleaning up more
  Y2038 messes!


He doesn't say quite what 'hotpatching the script' means, but I'm
guessing it's some indelicate kludge that in effect moves the beginning
of epoch to a later date without really solving the problem -- an
expedient choice to get back into production quickly, since the
alternative is rewriting and re-testing.  (He says downtime cost them 
$1.7M, as it is.)



[1] https://en.wikipedia.org/wiki/Unix_time#History says primordial
'Unix time had a 32-bit integer incrementing at a rate of 60 Hz, which
was the rate of the system clock on the hardware of the early Unix
systems', and the first edition of the Unix Programmer's Manual defines
the Unix time as 'the time since 00:00:00, 1 January 1971, measured in
sixtieths of a second'.  This was then rejiggered a couple of times, 
to yield the current setup.



More information about the conspire mailing list