[conspire] It's running in the Clown! Some call it Cloud.

Michael Paoli Michael.Paoli at cal.berkeley.edu
Sat Apr 23 18:27:48 PDT 2022


It's running in the Clown!  Some call it Cloud.

So, there was of course the recent Atlassian "fun",
for certain definitions of "fun".
About 400 customers, and some of them large/huge customers.
Well, Atlassian (Jira, Confluence, ...), if you were running
it in the Clown, Atlassian had a booboo.  And ... it took 'em
up to about 2 weeks to get those customers operational again:
https://www.atlassian.com/engineering/april-2022-outage-update
https://www.reddit.com/r/sysadmin/comments/u14qqq/atlassian_just_gave_us_an_estimate_on_our_support/
Google news: Atlassian outage (jira OR confluence) (>=2022-04-05):
https://www.google.com/search?q=atlassian+outage+%28jira+OR+confluence%29&tbs=cdr%3A1%2Ccd_min%3A4%2F5%2F2022&tbm=nws

So, ... in the Clown ... one 'o those BIG providers.
And of course it's always production.
https, TLS/SSL, certs, load balancers, ... oh what fun we can have.
So, yesterday updated cert ... no biggie, right.
It sits on load balancer.  Lots of virtual hosts (50) behind it,
hey it only handles (hundreds of) millions of simultaneous customers.
The TCP terminates on the load balancer, and that's where the cert is.
Got an older one that expires 3am Monday US/Pacific.  Replaced it
Friday ... all is good, right, ... right?  Doesn't expire for another
year.

But some of us bother to check these things ... not really trusting
Clowns all that much ... even if they're really big ones.
And ... it sort'a kind'a works ... almost.  Load balancer, ... lots of
IPs, lots of traffic.  Well, 3 IPs still serve up the old cert.  That
shouldn't be ... even >24 hrs. after updating, still the case.  Oh,
sure, most of the IPs are fine.  I find at least 26 IPs on that
load balancer ... and all but 3 fine.  Go through all the Clown
configuration bits, status, etc.  Everything claims to be fine ...
but it's not.  DNS gives various IPs ... 3 of which are still
problematic and serving up old cert.  Been dealing with big
Clown provider for many hours now ... they still can't figure out
why it doesn't work.  As far as all can tell, all our
Clown configs are fine.  They keep putting more eyes on it and
escalating ... but it's still broken.

Now, ... if this weren't in the Clown, I could dig in, as necessary
and appropriate to dig and isolate exactly what 'n where the problem
is, and fix it - or even work around it.  But, as it's in the
Clown, the Clown provider has the luxury of "Hey, at least it's
not my problem ... you're still paying us, right?  And our
billionaire that owns us keeps getting richer.".

Yup, ... still not working correctly ... many hours into it and
they're doing more escalation/transfer of the case.

We might have a work-around we could use ... but if we do that, that
may also squash Clown provider's chance to actually figure out
exactly why it's broken and how to work out having a permanent fix
so this kind of breakage doesn't happen again.  Meantime,
hurry-up-and-wait for Clown provider to fix it.

Yes, there are reasons some of us bother to test Clown stuff,
or more generally any 3rd party, etc.
Just because they say it's working and works and all is fine,
doesn't necessarily mean that's actually the case.




More information about the conspire mailing list