[sf-lug] .signature files collection

Rick Moen rick at linuxmafia.com
Fri Jan 23 01:58:36 PST 2015


Quoting Aaron Borden (adborden at live.com):

> Two thoughts regarding your encoding woes. First, text files don't carry anyencoding information. 

Quite so.

> As Daniel pointed out, utf will usually have a byte-ordermark which is
> different for utf-8, utf-16 LE, utf-16 BE, etc.

However, the information I cited suggested that the byte-ordermark
header was actively undesirable for specifically UTF-8.

> In email and on the web,Content-Types are used to communicate which
> character set (encoding) the mediashould be interpreted as.  For
> example, HTTP sends a Content-Type header for your file: $ curl -v
> 'http://linuxmafia.com/pub/humour/sigs-rickmoen-old' > /dev/null*
> Hostname was NOT found in DNS cache*   Trying ***...* Connected to
> linuxmafia.com (***) port 80 (#0)> GET /pub/humour/sigs-rickmoen-old
> HTTP/1.1> User-Agent: curl/7.39.0> Host: linuxmafia.com> Accept: */*><
> HTTP/1.1 200 OK< Date: Fri, 23 Jan 2015 06:24:34 GMT< Server: Apache<
> Last-Modified: Fri, 23 Jan 2015 02:27:09 GMT< ETag:
> "8bc96-18b7f-50d4886819940"< Accept-Ranges: bytes< Content-Length:
> 101247< Connection: close< Content-Type: text/plain<{ [data not
> shown]* Closing connection 0 Notice the "Content-Type: text/plain"?
> Well, HTTP allows you to specifya charset there as well: Content-Type:
> text/plain; charset=utf-8

Yes, this is about what I figured it would end up being.

Here's the thing:  The Web's framework for declaring charsets and
encoding is a bit one-size-fits-all.  Any candidate solution in that
area has to meet the Hippocratic criterion of 'First, do no harm.'
FWIW, my site includes a large number of US-ASCII plaintext files in
addition to HTML of various descriptions, and other things.

So far, I'm not even hearing candidate solutions.  (This is not a
complaint, just an observation.)  To recap, I adopted the RTF measure
because it made the problem go away without doing any harm.  So far,
it's meeting spec, and I'm hearing nothing better.

If I had gobs of spare time, I'd be enthusiastically researching the
matter and playing around with settings.  Alas, real life means I have
pressing concerns, which is why I instead adopted a measure that made
the problem go away without doing any harm.

I don't have time for new hobbies right now.
 

> I'm not familiar with Apache, but a quick Google search makes me
> wonder if youhave a default charset configured[1]. 

No.  Out of caution.  Please note comment lines:

$ cat /etc/apache2/conf.d/charset 
# Read the documentation before enabling AddDefaultCharset.
# In general, it is only a good idea if you know that all your files
# have this encoding. It will override any encoding given in the files
# in meta http-equiv or xml encoding tags.

#AddDefaultCharset UTF-8
$ 


> Without it, your browser will have toguess at what the encoding is and
> without the BOM, it will probably get thiswrong as Daniel pointed out.

With it, files will get claimed to be UTF-8 even though they aren't.

> Second, I wonder if at some point you mixed two encodings in the same
> file, orthe file was converted with the wrong encoding. 

{shrug}  You are welcome to look.

The file started out being US-ASCII plaintext, and then I realised the
need to migrate to UTF-8 and edited with vim in UTF-8 mode.

> Sorry for the wordy email, hope the list finds this useful.

Thank you for the thoughts.




More information about the sf-lug mailing list