[conspire] A Beautiful Unicode Rant
Rick Moen
rick at linuxmafia.com
Fri Jun 10 16:02:26 PDT 2011
Quoting Nick Moffitt (nick at zork.net):
> From that post comes this lovely bit of Rickbait:
> > Code that assumes that ASCII is good enough for writing English
> > properly is stupid, shortsighted, illiterate, broken, evil, and wrong.
> > Off with their heads! If that seems too extreme, we can compromise:
> > henceforth they may type only with their big toe from one foot (the
> > rest still be ducktaped).
I sincerely hope you don't think I disagree with Tom Christiansen on
_that_. 127 characters and a null was all we had on our 1970s Telebyte
ASR33 terminals, but everyone knew it was pretty grim. Thus the need
for filling it out by using the eighth bit. European languages finally
got a proper standard solution with ISO 8859-1 and variants (8859-5 for
Cyrillic, 8859-2 for Slavic tongues in the Latin alphabet, etc.
As far as I was concerned, that meant the problem was _fixed_, for
speakers of European languages. Mucking about further means imposing a
solution in the absence of a problem. Thus, I see nothing wrong with
UTF-8 qua UTF-8; I merely don't have any particular use for it, and get
a bit surly about relentless advocacy for it as supposedly a huge
improvement, when in fact it offers no advantage. (I don't _need_
multibyte just for English/French/etc. text. One byte works great.
It's not broken. I don't need it fixed. Thanks but no thanks. Please
back away slowly with your Tengwar epic poems and go help someone else.)
And that advocacy was rooted in rah-rah for Unicode, which is _grossly_
inappropriate as a solution for European languages, and creates huge
headaches that cannot remotely be justified in languages that don't need
it. Tom Christiansen ably described some of the intractable code problems
from 'Unicode by default' efforts. I particularly liked:
All your Perl code involving a-z or A-Z and such MUST BE CHANGED[...]
Code that assumes there are only two cases is broken. There's also
titlecase.
Code that assumes you can remove diacritics to get at base ASCII
letters is evil, still, broken, brain-damaged, wrong, and justification
for capital punishment.
Code that assumes dash, hyphens, and minuses are the same thing as
each other, or that there is only one of each, is broken and wrong.
Code that assumes that characters which look alike _are_ alike is broken.
Code that assumes that characters which do _not_ look alike are _not)
alike is broken.
Code that tries to reduce Unicode to ASCII is not merely wrong, its
perpetrator should never be allowed to work in programming again.
Period. I'm not even positive they should even be allowed to see
again, since it obviously hasn't done them much good so far.
Wow, casefolding breaks, alphabetisation becomes a UN committee
exercise, semantically different text becomes common that's visually
indistinguishable, and that's _just the start of the headaches_.
I remember a decade ago, Eric Raymond attempted to convince me that
Unicode was going to improve things greatly, and I believe the
particular example was domain FQDNs. I said just one thing: 'Hey,
that'll sure expose some new and glorious bugs and security meltdowns.'
To his credit, Eric sat and pondered that for a couple of minutes and
said 'Um, yes, I believe you're right.'
> It seems largely to be addressing people who are confused about unicode
> vs its various encodings. It also seems to be pro-utf8 but full of gar
> about anyone who's used anything else to encode unicode.
Although it's perfectly fine by me if Tom Christiansen likes UTF-8, and
takes no cash out of my pocket, I can't see even a single thing on the
cited page that indicates that. He simply says that his 'Boilerplate
for Unicode-Aware Code' uses that encoding -- which makes perfect sense,
because that's pretty much what it's actually _for_ -- for encoding all of
Unicode into railroad trains of multiple 8-bit bytes.
And if I needed to encode Unicode into multiple 8-bit bytes, that's what
I'd use, too. Lacking that need, I also don't need the overengineering
required (solely) for it.
More information about the conspire
mailing list