[sf-lug] .signature files collection
Aaron Borden
adborden at live.com
Fri Jan 23 01:03:43 PST 2015
Thanks for sharing the signatures, Rick.
Two thoughts regarding your encoding woes. First, text files don't carry anyencoding information. As Daniel pointed out, utf will usually have a byte-ordermark which is different for utf-8, utf-16 LE, utf-16 BE, etc. All the other8-bit encodings (and ascii) don't have any indicator. In email and on the web,Content-Types are used to communicate which character set (encoding) the mediashould be interpreted as.
For example, HTTP sends a Content-Type header for your file:
$ curl -v 'http://linuxmafia.com/pub/humour/sigs-rickmoen-old' > /dev/null* Hostname was NOT found in DNS cache* Trying ***...* Connected to linuxmafia.com (***) port 80 (#0)> GET /pub/humour/sigs-rickmoen-old HTTP/1.1> User-Agent: curl/7.39.0> Host: linuxmafia.com> Accept: */*>< HTTP/1.1 200 OK< Date: Fri, 23 Jan 2015 06:24:34 GMT< Server: Apache< Last-Modified: Fri, 23 Jan 2015 02:27:09 GMT< ETag: "8bc96-18b7f-50d4886819940"< Accept-Ranges: bytes< Content-Length: 101247< Connection: close< Content-Type: text/plain<{ [data not shown]* Closing connection 0
Notice the "Content-Type: text/plain"? Well, HTTP allows you to specifya charset there as well:
Content-Type: text/plain; charset=utf-8
I'm not familiar with Apache, but a quick Google search makes me wonder if youhave a default charset configured[1]. Without it, your browser will have toguess at what the encoding is and without the BOM, it will probably get thiswrong as Daniel pointed out.
Second, I wonder if at some point you mixed two encodings in the same file, orthe file was converted with the wrong encoding. I've seen this kind of thingbefore with MySQL, usually taking a utf-8 encoded string but telling MySQL it'ssome other 8-bit encoding and then reading it back as utf-8 (basically encodingit improperly). Here's some python to demonstrate:
# From the hexdump of the md5 86e7250690286c96aa7580d0c7f03857 file>>> data = '\x54\xc3\x83\xc2\xa1\x20\x6d\x27\xc3\x83\xc2\xa1\x72\x74\x68' + \... '\x61\x63\x68\x20\x66\x6f\x6c\x75\x61\x69\x6e\x65\x61\x63\x68\x20' + \... '\x6c\xc3\x83\xc2\xa1\x6e\x20\x64\x27\x65\x61\x73\x63\x61\x6e\x6e\x61'>>> data"T\xc3\x83\xc2\xa1 m'\xc3\x83\xc2\xa1rthach foluaineach l\xc3\x83\xc2\xa1n d'eascanna">>> print dataTá m'árthach foluaineach lán d'eascanna
This is my utf-8 terminal interpreting the byte data as text.
There's four bytes after the "T" to represent your "á" which is usually twobytes in utf-8. From my experience with encodings, this usually means your datawas already encoded in an 8-bit encoding and then it was double encoded intoutf-8. I would guess that the encoding is latin1 (iso-8859-1) which is alsoa very common encoding in the US. So let's try to reverse it:
>>> data.decode('utf-8').encode('latin1')"T\xc3\xa1 m'\xc3\xa1rthach foluaineach l\xc3\xa1n d'eascanna"
Two high bytes, that's better. If we print it in my utf-8 terminal:
>>> print data.decode('utf-8').encode('latin1')Tá m'árthach foluaineach lán d'eascanna
Note that even though the last python action was to encode to latin1, thesebytes are actually the correct utf-8 text. We know this because my utf-8terminal displays it as the intentional text.
I've attached a text file with the text encoded three ways, first with oldlatin1 + utf-8 double encoding, in utf-8, and finally with latin1. If you openit in Firefox and View > Character Encoding, you'll see different lines lookcorrect when you choose Unicode (UTF-8) and Western (latin1/iso-8859-1). Theoriginal text will never look correct, because encoding something twice (latin1+ utf-8) is not a valid encoding.
Sorry for the wordy email, hope the list finds this useful.
-Aaron
[1] http://httpd.apache.org/docs/2.2/en/mod/core.html#adddefaultcharset
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://linuxmafia.com/pipermail/sf-lug/attachments/20150123/18f347cf/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: example_encoding.txt
URL: <http://linuxmafia.com/pipermail/sf-lug/attachments/20150123/18f347cf/attachment.txt>
More information about the sf-lug
mailing list