[sf-lug] .signature files collection
Rick Moen
rick at linuxmafia.com
Fri Jan 23 01:58:36 PST 2015
Quoting Aaron Borden (adborden at live.com):
> Two thoughts regarding your encoding woes. First, text files don't carry anyencoding information.
Quite so.
> As Daniel pointed out, utf will usually have a byte-ordermark which is
> different for utf-8, utf-16 LE, utf-16 BE, etc.
However, the information I cited suggested that the byte-ordermark
header was actively undesirable for specifically UTF-8.
> In email and on the web,Content-Types are used to communicate which
> character set (encoding) the mediashould be interpreted as. For
> example, HTTP sends a Content-Type header for your file: $ curl -v
> 'http://linuxmafia.com/pub/humour/sigs-rickmoen-old' > /dev/null*
> Hostname was NOT found in DNS cache* Trying ***...* Connected to
> linuxmafia.com (***) port 80 (#0)> GET /pub/humour/sigs-rickmoen-old
> HTTP/1.1> User-Agent: curl/7.39.0> Host: linuxmafia.com> Accept: */*><
> HTTP/1.1 200 OK< Date: Fri, 23 Jan 2015 06:24:34 GMT< Server: Apache<
> Last-Modified: Fri, 23 Jan 2015 02:27:09 GMT< ETag:
> "8bc96-18b7f-50d4886819940"< Accept-Ranges: bytes< Content-Length:
> 101247< Connection: close< Content-Type: text/plain<{ [data not
> shown]* Closing connection 0 Notice the "Content-Type: text/plain"?
> Well, HTTP allows you to specifya charset there as well: Content-Type:
> text/plain; charset=utf-8
Yes, this is about what I figured it would end up being.
Here's the thing: The Web's framework for declaring charsets and
encoding is a bit one-size-fits-all. Any candidate solution in that
area has to meet the Hippocratic criterion of 'First, do no harm.'
FWIW, my site includes a large number of US-ASCII plaintext files in
addition to HTML of various descriptions, and other things.
So far, I'm not even hearing candidate solutions. (This is not a
complaint, just an observation.) To recap, I adopted the RTF measure
because it made the problem go away without doing any harm. So far,
it's meeting spec, and I'm hearing nothing better.
If I had gobs of spare time, I'd be enthusiastically researching the
matter and playing around with settings. Alas, real life means I have
pressing concerns, which is why I instead adopted a measure that made
the problem go away without doing any harm.
I don't have time for new hobbies right now.
> I'm not familiar with Apache, but a quick Google search makes me
> wonder if youhave a default charset configured[1].
No. Out of caution. Please note comment lines:
$ cat /etc/apache2/conf.d/charset
# Read the documentation before enabling AddDefaultCharset.
# In general, it is only a good idea if you know that all your files
# have this encoding. It will override any encoding given in the files
# in meta http-equiv or xml encoding tags.
#AddDefaultCharset UTF-8
$
> Without it, your browser will have toguess at what the encoding is and
> without the BOM, it will probably get thiswrong as Daniel pointed out.
With it, files will get claimed to be UTF-8 even though they aren't.
> Second, I wonder if at some point you mixed two encodings in the same
> file, orthe file was converted with the wrong encoding.
{shrug} You are welcome to look.
The file started out being US-ASCII plaintext, and then I realised the
need to migrate to UTF-8 and edited with vim in UTF-8 mode.
> Sorry for the wordy email, hope the list finds this useful.
Thank you for the thoughts.
More information about the sf-lug
mailing list