Pair Extraction Algorithm (PEA)

Summary

Also known as Bigram Replacement Algorithm, Pair Extraction Algorithm (hereafter referred to as PEA) is a lossless compression algorithm designed to compress blocks of plain text at a high level. The resulting compressed data is still plain text but occupies less characters than the original data. This has the benefit of enabling blocks of text to be stored in their original document format, yet be compressed.

PEA was primarily designed to compress textual content of XML elements and as such, produces compressed data which can be stored within XML elements and still allow the XML document to be well-formed.

An initial implementation of PEA was done using PHP, to test the potential of the algorithm. Given a 3823 word conference paper, PEA produced compressed output 91.2%% of the original size (an 8.8% saving). Given an 1081 word text block, PEA produced compressed output 88.1% of the original size (an 11.9% saving). After a series of tests (compressing text blocks of various sizes), the highest compression achived was a 13% saving. With potential additional explotation of the character set, a theoretical maximum compression saving of up to 20% could be achived; the reasons for this limitation will be covered in a latter section of this document.

Concept

The English language contains a large number of patterns which can be exploited in various ways when designing a compression format. PEA relies on the proliferation of common pairs of characters called bigrams; common bigrams being pairs of characters such as 'th', 'ng' or 'er'. The basic premise behind PEA is replacing bigrams with single character holders; or more simply, replacing pairs of characters within the document with a single holder character. Some reference of which bigram has been replaced with which holder needs to be included in the compressed data block. The compression process must be reversable, to perform the decompression process and restore the original data.

Selecting holders

When presented with a block of text, any holders selected to replace bigrams must not already occur within the text. With most bodies of text this severly limits the number of holder characters available; generally to use of non-alphanumeric characters. Even using non-alphanumeric characters, for each potential holder character, the block of text must be searched to ensure the desired holder is not already contained within the text. Using a character as a holder which is already present within the block of text would result in incorrect replacement of characters with bigrams during the decompression process and corrupt the output. Even with a large amount of bigrams that could be replaced, a lack of holders reduces the compression possible.

Clearing space in the character set

In order to ensure that for any block of text (regardless of its use of non-alphanumeric characters) some holders were available, the notion of using uppercase characters as holders was adopted. Obviously, this could conflict with existing characters in the text block (causing corruption during decompression). Replacing uppercase characters with their lowercase counterparts before replacing bigrams with uppercase holders would result in lossy compression; all characters within the text block would become lowercase, having passed through the compression and decompression process.

Which characters within the text block where uppercase needed to be recorded before the block was converted to lowercase; this data could then be transmitted in conjunction with the compressed block to enable the text block to be decompressed to its exact original form. This was achieved with a simple index method; when the text block is first received by the compression function the index value (character position) of all uppercase characters must be recorded. For example, in the following sonnet:

Some glory in their birth, some in their skill,
Some in their wealth, some in their block's force,
Some in their garments though new-fangled ill;
Some in their hawks and hounds, some in their horse;
And every humour hath his adjunct pleasure,
Wherein it finds a joy above the rest:
But these particulars are not my measure,
All these I better in one general best.
Thy love is better than high birth to me,
Richer than wealth, prouder than garments' cost,
Of more delight than hawks and horses be;
And having thee, of all men's pride I boast:
Wretched in this alone, that thou mayst take
All this away, and me most wretched make.

The following uppercase character index values would be recorded:

0 48 99 146 199 243 282 324 334 364 406 455 497 533 542 587

The text block can then be converted to lowercase and bigram replacement with uppercase holders performed. The index values (and suitable data mapping bigrams to uppercase holders) must then be transmitted in conjuction with the compression block.

Optimising the uppercase index

Two issues became apparent.

The first was the number of characters required to store the index values for uppercase characters. Even for a short sonnet, index values already reached three figures. Factoring in the requirement for a delimiter (in this case a space) between index entries, storing the index for one uppercase character required between two and four characters. How could this be reduced?

Base 10 is not a particularly efficent system in terms of characters required. Base 36 (0 to 9 then a to z) is more efficient at storing large numbers in fewer characters (in comparison to base 10). By storing the index values in base 36 instead of base 10, the uppercase index list can be reduced from:

0 48 99 146 199 243 282 324 334 364 406 455 497 533 542 587 (base 10)

to:

0 1c 2r 42 5j 6r 7u 90 9a a4 ba cn dt et f2 gb (base 36)

The second issue related to acronyms. Most people omit the full stops from acronyms, simply using uppercase characters: XML; PHP; XHTML; AJAX. For the following sentance:

"We develop web applications using XHTML, CSS, XML, PHP, AJAX and MySQL."

The uppercase indexes recorded would be:

0 36 37 38 39 40 43 44 45 48 49 50 53 54 55 58 59 60 61 67 69 70 71 (in base 10, for clarity)

0 10 11 12 13 14 17 18 19 1c 1d 1e 1h 1i 1j 1m 1n 1o 1p 1v 1x 1y 1z (in base 36)

The large number of uppercase characters within the sentance make the index recording method generate bloated results. Using a basic form of numerical run length encoding (recording the position of the first uppercase character, followed by the length of the sequential block of uppercase characters) character savings can be made. A delimiter is required between the starting index for a run and the number of characters within a run (in this case a '-' character is used).

So what was:

0 36 37 38 39 40 43 44 45 48 49 50 53 54 55 58 59 60 61 67 69 70 71 (in base 10, for clarity)

becomes:

0 36-5 43-3 48-3 53-3 58-4 67 69-3 (in base 10, for clarity)

In base 36:

0 10 11 12 13 14 17 18 19 1c 1d 1e 1h 1i 1j 1m 1n 1o 1p 1v 1x 1y 1z

becomes:

0 10-5 17-3 1c-3 1h-3 1m-4 1v 1x-3

These two simple additions to the uppercase index recording method allow the same information to be stored in significantly fewer characters. The requirement of including the uppercase index with the compressed text block adds an additional overhead which reduces compression levels achieved, but it is required to ensure lossless compression is performed.

Bigram frequency

Using the methods previously described, a minimum of 26 holders are available for any block of text being compressed ('A' through 'Z'). Certain bigrams occur with a greater frequency within the English language; 'th' being one of the most common. However, for a given block of text, the most freqeuent bigrams within that block may not match the general frequency within the whole English language. In order to provide the best compression possilble it was benefical to detect the frequency of bigrams specific to that particular block of text and allocated holders based on the bigram frequency. A list of all bigrams for a given block of text must be produced, with a numeric score for each bigram. The bigrams list must then be sorted from most frequent (those with the highest score) to least frequent (those with the lowest score). Holders 'A' to 'Z' are then allocated to the 26 most frequent bigrams within the block. Some record of which holder was assigned to which bigram must be recorded and included in the compressed block. Continuing with the sonnet example, this could be done in a lookup table:

A th
B in
C me
D ei
E an
F so
G or
H ou
I st
J ne
K ha
L he
M om
N er
O re
P se
Q is
R gh
S al
T nd
U be
V lo
W ll
X wk
Y it
Z ma

However, this requires five characters to store each holder-bigram lookup (the holder, a delimiter, the bigram and another delimiter). Since holders are always single characters and bigrams always two characters, the delimiters can be omitted, giving:

AthBinCmeDeiEanFsoGorHouIstJneKhaLheMomNerOrePseQisRghSalTndUbeVloWllXwkYitZma

For every compressed block, a similar pattern emerges though. The bigrams being replaced may differ (depening upon their frequency within the text block) but the holders always appear in the same order (since they are allocated sequentially):

A..B..C..D..E..F..G..H..I..J..K..L..M..N..O..P..Q..R..S..T..U..V..W..X..Y..Z.. (where .. are different bigrams)

Since the first 26 holders allocated are always 'A' through 'Z' and bigrams are always two characters in length, we simply need to store a list of 26 bigram:

thinmeeiansooroustnehaheomerreseisghalndbelollwkitma

This block of 52 characters (26 bigrams) can be attached to the compressed text block and extracted by the decompression function to rebuild the lookup table.

PEA compression format: uppercase holders only

To give a complete compression example, we start with the source text block to be compressed:

Some glory in their birth, some in their skill,
Some in their wealth, some in their block's force,
Some in their garments though new-fangled ill;
Some in their hawks and hounds, some in their horse;
And every humour hath his adjunct pleasure,
Wherein it finds a joy above the rest:
But these particulars are not my measure,
All these I better in one general best.
Thy love is better than high birth to me,
Richer than wealth, prouder than garments' cost,
Of more delight than hawks and horses be;
And having thee, of all men's pride I boast:
Wretched in this alone, that thou mayst take
All this away, and me most wretched make.

We record the position of all uppercase characters within the text block:

0 48 99 146 199 243 282 324 334 364 406 455 497 533 542 587

Blocks of uppercase characters have run length encoding applied (none occur in this example).

The uppercase indexes are converted to base 36:

0 1c 2r 42 5j 6r 7u 90 9a a4 ba cn dt et f2 gb

We can now convert the text block to lowercase:

some glory in their birth, some in their skill,
some in their wealth, some in their block's force,
some in their garments though new-fangled ill;
some in their hawks and hounds, some in their horse;
and every humour hath his adjunct pleasure,
wherein it finds a joy above the rest:
but these particulars are not my measure,
all these i better in one general best.
thy love is better than high birth to me,
richer than wealth, prouder than garments' cost,
of more delight than hawks and horses be;
and having thee, of all men's pride i boast:
wretched in this alone, that thou mayst take
all this away, and me most wretched make.

We now produce a list of all bigrams within the text block, record their frequency and sort the list (more frequent to less frequent):

th 25
in 12
me 9
ei 7
an 7
so 7
or 5
ou 4
st 4
ne 4
ha 4
he 4
om 4
er 4
re 4
se 4
is 3
gh 3
al 3
nd 3
be 3
lo 3
ll 3
wk 2
it 2
ma 2
il 2
ic 2
rs 2
ke 2
gl 2
hi 2
ur 2
as 2
ve 2
ir 2
tc 2
ga 2
nt 2
ar 2
de 2
we 2
tt 2
hy 1
to 1
pl 1
co 1
pr 1
ct 1
ee 1
ri 1
ta 1
ut 1
of 1
pa 1
no 1
my 1
es 1
ul 1
ab 1
bo 1
ay 1
jo 1
ad 1
fa 1
aw 1
un 1
sk 1
ho 1
en 1
ce 1
ck 1
at 1
ds 1
ea 1
ry 1
hu 1
mo 1
rt 1

We take the 26 most frequent bigrams and replace them in the text block with sequentially allocated uppercase holders ('A' through 'Z'). Note that 'A' is not allocated to the first bigram encounted, but the most frequent (hence why the first bigram in this example is not replaced with 'A'):

B ADr birA, FC B ADr skiW,
FC B ADr weSA, FC B ADr bVck's fGce,
FC B ADr garCnts AHR Jw-fEgled iW;
FC B ADr KXs Ed hHTs, FC B ADr hGP;
Ed evNy humHr KA hQ adjunct pleasuO,
wLOB Y fBds a joy above Ae OI:
but AeP particulars aO not my CasuO,
Sl AeP i UttN B oJ geJrS UI.
Ay Vve Q UttN AE hiR birA to C,
ricLr AE weSA, prHdN AE garCnts' coI,
of mGe deliRt AE KXs Ed hGPs U;
Ed KvBg Aee, of Sl Cn's pride i boaI:
wOtcLd B AQ SoJ, Aat AH ZyI take
Sl AQ away, Ed C moI wOtcLd Zke.

We now combine the bigram uppercase holder assignment block:

thinmeeiansooroustnehaheomerreseisghalndbelollwkitma

With a delimiter:

.

Followed by the uppercase indexes in base 36 format:

0 1c 2r 42 5j 6r 7u 90 9a a4 ba cn dt et f2 gb

Followed by another delimiter:
.

Finally followed by the compressed source text:

B ADr birA, FC B ADr skiW,
FC B ADr weSA, FC B ADr bVck's fGce,
FC B ADr garCnts AHR Jw-fEgled iW;
FC B ADr KXs Ed hHTs, FC B ADr hGP;
Ed evNy humHr KA hQ adjunct pleasuO,
wLOB Y fBds a joy above Ae OI:
but AeP particulars aO not my CasuO,
Sl AeP i UttN B oJ geJrS UI.
Ay Vve Q UttN AE hiR birA to C,
ricLr AE weSA, prHdN AE garCnts' coI,
of mGe deliRt AE KXs Ed hGPs U;
Ed KvBg Aee, of Sl Cn's pride i boaI:
wOtcLd B AQ SoJ, Aat AH ZyI take
Sl AQ away, Ed C moI wOtcLd Zke.

Producing a compressed text block of:

thinmeeiansooroustnehaheomerreseisghalndbelollwkitma.0 1c 2q 41 5i 6q 7t 8z 99 a3 b9 cm ds es f1 gb.B ADr birA, FC B ADr skiW,
FC B ADr weSA, FC B ADr bVck's fGce,
FC B ADr garCnts AHR Jw-fEgled iW;
FC B ADr KXs Ed hHTs, FC B ADr hGP;
Ed evNy humHr KA hQ adjunct pleasuO,
wLOB Y fBds a joy above Ae OI:
but AeP particulars aO not my CasuO,
Sl AeP i UttN B oJ geJrS UI.
Ay Vve Q UttN AE hiR birA to C,
ricLr AE weSA, prHdN AE garCnts' coI,
of mGe deliRt AE KXs Ed hGPs U;
Ed KvBg Aee, of Sl Cn's pride i boaI:
wOtcLd B AQ SoJ, Aat AH ZyI take
Sl AQ away, Ed C moI wOtcLd Zke.

At the decompression stage we first seperate the bigram holder assignement block, uppercase index and compressed source text block (utilising the full stop delimiters). All content from the beginning of the block to the first occurence of a full stop character is recognised as the bigram holder assignment block. All content between the first occurence of a full stop and the second occurence of a full stop is recognised as the uppercase indexes in base 36 format. All content from the second occurence of a full stop to the end of the block is recognised as the compressed source text.

The bigram holder assignment block is then spilt up into pairs of characters (the block is always an even number in length). The first pair from the block is assumed to be a birgram represented by the holder 'A'; the second pair from the block assumed to be a birgram represented by the holder 'B', and so forth. The uppercase holders within the compressed source block are then replaced with the corresponding bigram they represent.

Next, the uppercase indexes are converted back to base 10 (and any run length encoding expanded). and finally the characters at the specified indexes are converted into their uppercase form. The text block is now fully reconstructed.

Given the sonnet example, which is initially 628 characters in length, once compression headers (bigram holder assignement, uppercase index and delimiters) are accounted for (they must be attached to the compressed source text), the total length is reduced to 582 characters. The length is reduced by 46 characters; a 7.3% saving.

Given a larger block of five sonnets which is 3074 characters in length, the compressed output (inclusive of overheads) is 2828 characters in length; a saving of 246 characters, or 8%.

Improving compression: potential holders

Often, with blocks of text above a certain length, the limitiation of the compression is a lack of holders available. More than 26 bigrams are present (each occuring multiple times) but only the 26 most frequently occuring are assigned uppercase holders. Much trickery was required to ensure a minimum of 26 holders (the characters 'A' though 'Z') were not present within the text block and available for use as holders.

Character sets consist of more than simply 'a' through 'z' and 'A' through 'Z' characters. Digits, punctuation & other characters may not be present within the text block and are potential holders. Certain characters such as the full stop and comma are found in almost all text blocks presented for compression (perhaps with the exception of text authored by those who give grammarians nightmares). However, some characters such as '{' or '#' are rarely found in text blocks, perhaps with the exception of technical documents or blocks of code or markup.

A list of these potential holders can be created for a particular character set and the text block checked for inclusion of each potential holder; a potential holder can only be used as a holder if it is not present with the source text. Since these are optionally used, their allocation sequence may not match the potential holders list accessible to the compression and decompression functions. This requires that when recording the bigram being replaced, the holder character must be included (since it is not guarenteed to be the next one on the list held by the decompression function).

Continuing with the five sonnets block example, the 26 most frequent bigrams would be replaced with 26 uppercase holders and the following string be included in the compressed text block:

thinouoranmestsehailenheesissoitndtoervehirsatrearof

For each potential holder, the souce text is checked ensure it does not already contain the character. If the character is not used, it can be utilised as a holder. The holder assignment process continues. replacing the 27th most frequent bigram with the next unused holder.

This process continues until all bigrams have been replaced by holders; or, more likely, all potential holders have been tested and either utilised as holders or not used (since the characters were already present within the source text).

A list of potential holders might be:

0 1 2 3 4 5 6 7 8 9 ! # $ % ( ) * + - / : ; = ? @ [ ] ^ _ ` { } | ~

These are characters which may be found in blocks of text, but the overhead of checking for their presence within a text block compared to their potential saving if they are not present and may be used as holders is a suitable trade-off. This is by no means an exhaustive list, even for the basic Latin 1 character set.

Bigrams using potential holders then need to have their decompression lookup data added to the block:

li0ho1wl2be3ll4ei5ma6as7we8lo9om#no$on%gh(ee)do*ur+ir/ow=gl@ke[ch]my^ri_fe`ce{ut}el|ed~

These are stored in the format of the bigram followed immediately by the holder character. No delimiters are required since each bigram is always two characters and each holder always a single character.

No delimiter is required betweeen the uppercase holder assignment block and the potential bigram-holder assigment block; the uppercase block will occupy 52 characters and any content between the end of uppercase block and the end of holder assignment block delimiter is potential holder assigment data. Uppercase holders are always allocated before potential holders; if the uppercase block is less than 52 characters in length, less than 26 bigrams were detected in the source text block and no potential headers were utilised.

The resulting output from compressing the five sonnets example using uppercase holders and the potential holders listed above results in a file 87.3% the size of the original: a saving of 12.7%.