[conspire] A Beautiful Unicode Rant

Rick Moen rick at linuxmafia.com
Fri Jun 10 17:09:06 PDT 2011


I wrote:

> I see nothing wrong with
> UTF-8 qua UTF-8; I merely don't have any particular use for it, and get
> a bit surly about relentless advocacy for it as supposedly a huge
> improvement, when in fact it offers no advantage.
[...]
> And if I needed to encode Unicode into multiple 8-bit bytes, that's what
> I'd use, too.  Lacking that need, I also don't need the overengineering
> required (solely) for it.

It's vexing when you find out that the relentless advocacy has invaded
your _toolsets_.  I recently found one such, and its fix, which I am
providing here pro bono publico:


Ed Cherlin posted earlier the (rather uselessly vague) observation that
some (totally unspecified) Web pages among the vast number on my Web
server fail (totally unspecified) validation tests.

In addition to spot-checking (some more) with the online W3C HTML and
CSS validators, I checked a few of my locally generated Web pages' HEAD
sections manually.  Hmm, look here:

<head>
  <meta name="generator" content=
  "HTML Tidy for Linux/x86 (vers 1st March 2004), see www.w3.org">
  <title>Kudzu and the Marriage Amendment</title>
<link href="template_css.css" rel="stylesheet" type="text/css">
<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
</head>

Eh, what?  Sonofabitch, it's HTML Tidy doing that 'charset=UTF-8' thing
_by default_.  I tend to run all of my locally-made pages through
Tidy.  Deirdre found for me a reference in the online docs for the
software:

 http://tidy.sourceforge.net/docs/quickref.html#char-encoding

That's a bit vague, but it turns out that  /etc/tidy.conf or ~/.tidyrc
(or whatever demented alternative the software's compiled to use) can
have 

  char-encoding: [foo]

So:  

   char-encoding: latin0

'latin0' is a somewhat inaccurate alias for ISO 8859-15, the variant of
8859-1 (latin1) as amended to include the Euro symbol.  (You can put
'latin1' there, if you prefer plain Latin1 encoding.)

Anyway, I found and fixed a few new-ish validation problems that were
obviously _introduced_ by HTML Tidy's insistance on UTF-8 for everyone
by default -- by substituting 'charset=ISO-8859-1' and revalidating.


Going forward, I may install and use Debian package wdg-html-validator,
which is the W3C validator as a Perl library with CGI and command-line
interfaces, to recurse through linuxmafia.com's pages and log whatever
other validation problems exist.





More information about the conspire mailing list