To users of the Doc Reader program. PilotDOC available at http://www.concentric.net/~rbram/doc.shtml Over the next few days, you will see the introduction of compression to the Doc Reader. This compression usually yields data files which are 53%-58% of their original size. The compression is optimized for readable text; it is indifferent to upper and lower case. I wrote the code that is going to be included in the DocReader, and I wanted to address some issues with you, the user. From recent usenet and list traffic, I know some of you are expecting files that are 15%-20% of their original size. I was not able to achieve these figures, nor do I believe it is possible. The rest of this note is a discussion of why. As is often seen in comp.compression, discussions of this sort are often inflamitory. This note is intended to set the ground rules for the expected inferno. ------------------------------------------------------ Pilot Environment ----------------- In the Pilot, we only have a tiny bit of working memory. There is lots of bulk memory, but is usually protected from writes. As a rule, 4k is available always, and 10k-12k usually. The Pilot microprocessor, while decent, is not ultra powerful. We needed to develop a coding scheme that runs quickly, without deep recursion. All told, we defined this environment: compressor: no memory, processor or speed limits decompressor: less than .1s using a 2 mips processor requires less than 5k of working memory recurses less than 10 call deep builds to less than 200 bytes of code handles 4k blocks of data, each independant handles accented chars & tabs This memory limitation precludes the use of standard adaptive Huffman coding. These codings usually require 40k to 128k of memory to hold the tree data. Other Compressors ----------------- The industry standard PKZIP compresses by first scanning the data for repetitive patterns (sliding window), merging these down to special codes, and then running adaptive huffman compression on the total. When the text base is large, both of these techniques benefit. Large text files do, in fact, shrink to 35% of their original size. Smaller files, of 4k or so, usually shrink to 50% of their original size. The sliding window mechanism does not require much memory to decode, so I use it. The adaptive huffman mechanism does use lots, so I have chosen a trivial subset of huffman (one char; see below). There have been compressors written since PKZIP which slightly improve on the compression. Almost all of them use the same techniques. A couple of recent ones used a sort-and-huffman technique, which is also memory intensive. This Compression ---------------- This compression is best described by defining the decode process. Given a compressed stream, read a byte. The byte will lie in the following zones: 0 represents itself 1...8 type A command; read the next n bytes 9...7F represents itself 80..BF type B command; read one more byte C0..FF type C command; represent "space + char" The codes 0,9...7F simply represent themselves; copy the byte to the decode buffer. The type A codes indicate that the next n bytes is a string of raw data. The value of n is 1 to 8, exactly the value of the type A code. This code is used to wrap accented and high characters. The type B codes indicate a sliding window sequence. Read the next byte after the command code, and join them to form a 16 bit number. Throw away the top two bits; they should be 10 anyway. Take the next 11 bits and form a number from 0 to 2047 = m. Take the last 3 bits and form a number from 0 to 7. (m==0 should never occur) Reach back into the data that you have decoded already, by m bytes. Copy (n+3) bytes from the old data address to the new data address. The type C codes indicate a special sequence of "space plus char". The code that you read as the type C code represents a space character, followed by the ASCII char formed by the lower 7 bits of the code. [This is a single byte code]. Argument -------- This compression is basically trivial huffman (space character takes one bit; type C codes) followed by a sliding window mechanism (type B codes). There is a special mechanism to handle accented chars, since they interfere with the above to codeings (type A). This coding reduces bulk prose to 53%-57% or its original size. Since PKZIP yields 48%-52%, this new coding is not bad. The implementation on the Pilot microprosessor uses no ram other than a 4k decode buffer, takes about 32ms per block on a 2 mips micro, and consumes about 70 bytes of code. It does not recurse, and handles all the high characters. Action ------ I expect that some of you will not trust these numbers. If you were expecting 15%-30% file sizes, I can understand this. If anyone feels that there is another way, I will gladly accept and analyze alternate algorithms, as long as they can execute within the environment described above. If you are about to invest effort in compressing large amounts of data for the Pilot, I suggest that you wait until the Dec 1 timeframe before you begin, just in case another technique comes along. If you haven't seen an announcement in this forum by Dec 1, you can assume that the compression that is used in PilotDOC is stable and long-term. Pat Beirne patb@corel.com
Back to my home page.