PilotDOC Compression

The following is a paper I wrote during the first days of the release of the compression for PilotDOC, now known as AportisDoc. It was written to ask for constructive criticism. Well, time has passed, and it's too late to change it, or criticize it :) But the algorithm hasn't changed.

To users of the Doc Reader program.

PilotDOC available at 
http://www.concentric.net/~rbram/doc.shtml

Over the next few days, you will see the introduction of
compression to the Doc Reader. This compression usually yields
data files which are 53%-58% of their original size. 
The compression is optimized for readable text; it is indifferent
to upper and lower case.

I wrote the code that is going to be included in the DocReader,
and I wanted to address some issues with you, the user.

From recent usenet and list traffic, I know some of you are
expecting files that are 15%-20% of their original size.
I was not able to achieve these figures, nor do I believe it
is possible. The rest of this note is a discussion of why.

As is often seen in comp.compression, discussions of this
sort are often inflamitory. This note is intended to set the
ground rules for the expected inferno.

------------------------------------------------------

Pilot Environment
-----------------

In the Pilot, we only have a tiny bit of working memory.
There is lots of bulk memory, but is usually protected
from writes. As a rule, 4k is available always, and 10k-12k
usually.

The Pilot microprocessor, while decent, is not ultra
powerful. We needed to develop a coding scheme that runs
quickly, without deep recursion.

All told, we defined this environment:

compressor:
	no memory, processor or speed limits
decompressor:
	less than .1s using a 2 mips processor
	requires less than 5k of working memory
	recurses less than 10 call deep
	builds to less than 200 bytes of code
	handles 4k blocks of data, each independant
	handles accented chars & tabs

This memory limitation precludes the use of standard
adaptive Huffman coding. These codings usually require 40k
to 128k of memory to hold the tree data.


Other Compressors
-----------------

The industry standard PKZIP compresses by first scanning
the data for repetitive patterns (sliding window), merging
these down to special codes, and then running adaptive
huffman compression on the total.

When the text base is large, both of these techniques benefit.
Large text files do, in fact, shrink to 35% of their original
size. Smaller files, of 4k or so, usually shrink to 50% of
their original size.

The sliding window mechanism does not require much memory
to decode, so I use it. The adaptive huffman mechanism 
does use lots, so I have chosen a trivial subset of huffman
(one char; see below).

There have been compressors written since PKZIP which
slightly improve on the compression. Almost all of them
use the same techniques. A couple of recent ones used a
sort-and-huffman technique, which is also memory intensive.


This Compression
----------------

This compression is best described by defining the 
decode process.

Given a compressed stream, read a byte.
The byte will lie in the following zones:
0       represents itself
1...8   type A command; read the next n bytes
9...7F  represents itself
80..BF  type B command; read one more byte
C0..FF  type C command; represent "space + char"

The codes 0,9...7F simply represent themselves; copy the byte
to the decode buffer.

The type A codes indicate that the next n bytes is a string of 
raw data. The value of n is 1 to 8, exactly the value of
the type A code. This code is used to wrap accented and high
characters.

The type B codes indicate a sliding window sequence. Read
the next byte after the command code, and join them
to form a 16 bit number. Throw away the top two bits;
they should be 10 anyway. Take the next 11 bits and form
a number from 0 to 2047 = m. Take the last 3 bits
and form a number from 0 to 7.
(m==0 should never occur)
Reach back into the data that you have decoded already,
by m bytes. Copy (n+3) bytes from the old data address to the
new data address.

The type C codes indicate a special sequence of
"space plus char". The code that you read as the type C code
represents a space character, followed by the ASCII char
formed by the lower 7 bits of the code. [This is a single
byte code].

Argument
--------
This compression is basically trivial huffman (space character
takes one bit; type C codes) followed by a sliding window mechanism
(type B codes). There is a special mechanism to handle accented
chars, since they interfere with the above to codeings (type A).

This coding reduces bulk prose to 53%-57% or its original size.
Since PKZIP yields 48%-52%, this new coding is not bad.

The implementation on the Pilot microprosessor uses
no ram other than a 4k decode buffer, takes about 32ms
per block on a 2 mips micro, and consumes about 70 bytes of code.
It does not recurse, and handles all the high characters.


Action
------

I expect that some of you will not trust these numbers.
If you were expecting 15%-30% file sizes, I can understand this.

If anyone feels that there is another way, I will gladly
accept and analyze alternate algorithms, as long as they
can execute within the environment described above.

If you are about to invest effort in compressing large amounts
of data for the Pilot, I suggest that you wait until the Dec 1
timeframe before you begin, just in case another
technique comes along. If you haven't seen an announcement in
this forum by Dec 1, you can assume that the compression
that is used in PilotDOC is stable and long-term.

Pat Beirne
patb@corel.com

Back to my home page. 


Contact me by email or email