An inconvenient problem

Rupture has offered very promising results for compression attacks so far. In the following paragraphs I will describe how we exploit compression and an inherent problem that we had failed to recognize until recently.

How compression works

DEFLATE is the composition of two compression algorithms, LZ77 and Huffman. These algorithms are applied on the plaintext in that order, first LZ77 and then Huffman on the output of LZ77. Both algorithms remove repetitions of common strings in the plaintext, although they approach it differently.

LZ77 removes repetitions by replacing common strings with pointers. Specifically, it keeps track of a 32 Byte window of the plaintext. For each string it attempts to compress, it checks for repetitions in the window. If the string hasn't appeared before, it is appended as a literal in the compressed output. If the string appeared in the plaintext within the window, all of its characters are replaced with a two-part pointer, the one part describing the position of the previous appearance and the other describing the length of the string, as shown in the following example.

Plaintext:

Hello, world! I love you. Hello, world! I hate you.  

Compressed text:

Hello, world! I love you. (26,16)hate(26,5)  

In the end, the LZ77-compressed text contains both literal characters and pointers. The literals and the pointers are treated as symbols by the Huffman coding that follows.

Huffman replaces the input symbols with bitstreams of varying lengths. During the first part of the algorithm a table of frequencies is created, which contains all symbols in the input. In our case, a symbol is either a literal or a pointer. The next step is creating the coding by assigning a bitstream for each symbol, where the most frequent symbols are assigned a smaller bitstream compared to the more rare. The coding is prefix free, meaning that any small bitstream a cannot be the prefix of a larger bistream b. The output is then generated by replacing the input symbols with the bitstreams assigned on the first step. The final output contains both the table of the symbol and bitstream assignments and the compressed output.

The Huffman problem

Let's now consider a minimal testcase of compression. Say we have the plaintext "abcab" and we can append a string of our choice to it. Our goal is to recognize that the character "c" follows the first occurence of "ab". Now, we could try the characters "a", "b", "c" and see which one results in smaller compressed output. This is a correct first step, since we assume that LZ77 will compress better with the correct character. Indeed, the LZ77 output would be "abcaba", "abcabb" and "abc(3,3)" respectively, where (3,3) is a pointer.

However, Huffman then gets into action. In the first case we have 3 "a", 2 "b" and 1 "c". In the second case we have 2 "a", 3 "b" and 1 "c". In the third case we have 1 "a", 1 "b", 1 "c" and 1 (3,3). The above are gathered in the following table for better display. It is obvious that the Huffman coding will vary in all cases and as a result the final output is unpredictable, as either of the two wrong characters ("a" and "b") may result in better Huffman compression compared to "c".

  • abcaba → 3xa 2xb 1xc
  • abcabb → 2xa 3xb 1xc
  • abc(3,3) → 1xa 1xb 1xc 1x(3,3)

So, the basic problem is that by trying different characters we change the letter frequency for the Huffman coding. The solution that we have used up till now is the Huffman pool. When using this method, we try the strings "a^b^c", "b^a^c", "c^a^b" instead of "a", "b", "c". The results now are shown in the table below. These results are more optimistic, since the wrong characters demonstrate the same letter frequency and only the correct one results in largely different symbol frequencies for the Huffman coding.

  • abcaba^b^c → 3xa 3xb 2xc 2x^
  • abcabb^a^c → 3xa 3xb 2xc 2x^
  • abc(3,3)^a^b → 2xa 2xb 1xc 2x^ 1x(3,3)

A better result, but still not ideal, occurs for a known prefix larger than 2 characters, such as with the plaintext "abcdabc":

  • abcd(4,3)a^b^c^d → 2xa 2xb 2xc 2xd 3x^ 1x(4,3)
  • abcd(4,3)b^a^c^d → 2xa 2xb 2xc 2xd 3x^ 1x(4,3)
  • abcd(4,3)c^a^b^d → 2xa 2xb 2xc 2xd 3x^ 1x(4,3)
  • abcd(4,4)^a^b^c → 2xa 2xb 2xc 1xd 3x^ 1x(4,4)

However, the fundamental problem still exists. The fact that the correct character results in different Huffman coding may again result in unpredictable output lengths, since the Huffman algorithm may hide the LZ77 advantage that the correct candidate earns. We have managed to get the same Huffman coding for all wrong characters, so in many cases we can assume that the one character that does not follow the herd, i.e. the compressed text is larger or smaller that all the others, is the correct one. There can be cases though when the Huffman difference for the wrong characters is exactly the LZ77 one character advantage for the correct one. Can we do better?