Compressed LZSS File Format: Difference between revisions
(compression) |
|||
Line 59: | Line 59: | ||
==Decompression Code== | ==Decompression Code== | ||
===Unknown Input Length=== | |||
// will return the actual number of input bytes discovered, or -1 on error. | |||
int ExpandUnknownInputLength(const byte *in, byte *OutBuf,long outlen) | int ExpandUnknownInputLength(const byte *in, byte *OutBuf,long outlen) | ||
{ | { | ||
Line 116: | Line 119: | ||
if (ReadChecksum == CalculatedChecksum) return pi+4; | if (ReadChecksum == CalculatedChecksum) return pi+4; | ||
return -1; | return -1; | ||
} | |||
===ExpandPbo=== | |||
// input length is always known, making code a little less tedious | |||
bool unpack_data(const byte *CompressedBuffer, int PackedSize, byte *DeCompressedBuffer,int DecompressedBufferSize) | |||
{ | |||
const byte *CompressedPtr; | |||
byte *DecompressedPtr; | |||
size_t offset; | |||
int tmp, count; | |||
unsigned ReadChecksum, CalculatedChecksum; | |||
int rpos, rlen; | |||
int chunk, size; | |||
uchar FlagByte; | |||
byte b1, b2; | |||
if ( (PackedSize-=4)<=0) return false; // no room for checksum | |||
size = 0; | |||
DecompressedPtr =DeCompressedBuffer; | |||
/* Each data block is preceded by a byte telling us what to do with | |||
* the next 8 bytes. | |||
*/ | |||
for (offset=0;offset < PackedSize; ) | |||
{ | |||
FlagByte = CompressedBuffer[offset++];// grab the encoding byte | |||
for (count = tmp=1;tmp < 256 && offset < PackedSize ; tmp*=2, count++) | |||
{ | |||
CompressedPtr = &CompressedBuffer[offset]; | |||
size = DecompressedPtr - DeCompressedBuffer; | |||
if (FlagByte & tmp) /* Append the byte to output */ | |||
{ | |||
*DecompressedPtr++ = *CompressedPtr; | |||
offset++; | |||
} | |||
else | |||
{ | |||
/* Read a pointer */ | |||
b1 = *CompressedPtr; | |||
b2 = *(CompressedPtr + 1); | |||
offset += 2; | |||
rpos = size - b1 - 256 * ( b2/16 ); | |||
rlen = b2 - 16 * (b2 / 16) + 3; | |||
if ((rpos + rlen) < 0) | |||
{ | |||
size+=rlen; | |||
while (rlen--) *DecompressedPtr++ = ' '; | |||
} | |||
else | |||
if ((DeCompressedBuffer + rpos) > DecompressedPtr) return false;//PBO_STS_OUT_OF_BOUNDS; | |||
/* PAD the file with a block from another place in the file */ | |||
else | |||
if ((rpos + rlen) <= size) | |||
{ | |||
while (rpos < 0) | |||
{ | |||
*DecompressedPtr++ = ' '; | |||
rlen--; | |||
rpos++; | |||
} | |||
memcpy(DecompressedPtr, &DeCompressedBuffer[rpos], rlen); | |||
DecompressedPtr += rlen; | |||
size += rlen; | |||
} | |||
/* PAD the file with the block until size is rpos + rlen */ | |||
else | |||
if ((rpos + rlen) > size) | |||
{ | |||
chunk = size - rpos; | |||
while (rlen > 0) | |||
{ | |||
if (chunk > rlen) chunk = rlen; | |||
if (!chunk) return false;//PBO_STS_CHUNK_ZERO; | |||
memcpy(DecompressedPtr, &DeCompressedBuffer[rpos], chunk); | |||
DecompressedPtr += chunk; | |||
size += chunk; | |||
rlen -= chunk; | |||
} | |||
} | |||
} | |||
}//while temp | |||
}//while offset | |||
/* Last 4 bytes of the packed data is the checksum, unsigned int */ | |||
memcpy(&ReadChecksum, &CompressedBuffer[PackedSize], sizeof(ReadChecksum)); | |||
for (CalculatedChecksum = 0,DecompressedPtr = DeCompressedBuffer; DecompressedPtr < &DeCompressedBuffer[DecompressedBufferSize];) CalculatedChecksum+= *DecompressedPtr++; | |||
if (ReadChecksum != CalculatedChecksum) return false;//PBO_STS_CHECKSUM; | |||
return true; //PBO_STS_NO_ERROR; | |||
} | } |
Revision as of 08:18, 19 February 2009
Introduction
Lev Zimpel (LZ) compression is a form of run length encoding that was first introduced in 1983. Since that time, it has had many variants and 'improvements' and constitutes the base form of many archive file formats such as zip, pkzip, 7Zip, LHarc, gunzip, rar. Each, with (sometimes substantial) variants on the theme.
The most popular variants being
- LZH
- LZW
- LZSS
BI use an improved version of the original LZ compression known as LZSS: Lempel, Ziv, Storer, and Szymanski Encoding and Decoding
The patent for LZSS can be found Here
The patent describes the overall methodology, essentially, an improved-way-of-doing-things over the earlier LZ compression, without specific implementations (of which, there can be many).
The essential modifiable ingredients to LZSS are
- The number of bits making up a flag to indicate raw data follows, or, a 'pointer'
- The number of bits in raw data.
- The number of bits in the pointer for relative offset, versus run length.
- The size of the 'sliding window' that constitutes the dictionary for compression.
- The maximum number of characters to match in that dictionary.
- An agreed 'space filler'.
All of which is implementation dependent.
As it applies to BI, they choose to use LZSS:8bit
- FlagBits = 8
- Data = Byte.
- 12bit offset, 4bit run length. (two bytes)
- 4096 byte 'sliding window'.
- 18 character best match.
- 'space filler' = 0x20
For more on specific methods of implementation visit here
General
If the 'bit' in the flag is a 1, it's a raw data byte, otherwise a 2 byte (16bit) 'pointer' follows.
A mixture of 8x raw data bytes and /or 8x pointers follow each flagbyte.
The pointer consists of a 4bit run length, and a 12 bit relative offset.
All / most / some / none of offset points into a previously built part of the output. A 4096 byte sliding window. As each byte is progressively added to the output, the window 'slides'.
The 4k sliding window, is, the dictionary for the compression.
It is quite possible before first 4k of output has been constructed, that the offset will refer to a larger value than that actually built.
An 'agreed' value for this phantom buffer is established. In the case of BI's implementation, it is the space character (0x20).
The format of the 'pointer' is unfortunately AAAA LLLL AAAAAAAA, requiring a bit of shift mask fiddling. Very clearly, as originally implemented, this 'format' came from Big Endian machines such as Motorola, giving a far cleaner value of AAAAAAAA AAAA LLLL
As implemented by BI, there is an additional 4 byte checksum at the end of any compressed block. The checksum is simply an unsigned additive spillover of each byte in the built output.
There are in fact two different structures used by BI in implementing LZSS.
- Input (and output) Size is known
- Only the desired output size is known.
Decompression Code
Unknown Input Length
// will return the actual number of input bytes discovered, or -1 on error.
int ExpandUnknownInputLength(const byte *in, byte *OutBuf,long outlen) { byte Flag; byte bits; long Rpos; byte rlen; int Fl=0; ulong CalculatedChecksum=0; ulong ReadChecksum; int pi=0; byte data; byte *to,*from; while (outlen>0) { Flag=in[pi++]; for (bits=0;bits<8;bits++,Flag>>=1)// up to 8 bytes of data or 8 pointers or { if (Flag&0x01) // raw data { data=in[pi++]; CalculatedChecksum+=data; OutBuf[Fl++]=data; if (!--outlen) goto finish; continue; } Rpos=in[pi++]; rlen=(in[pi]&0x0F) +3; Rpos+=((in[pi++]&0xF0)<<4); while (Rpos>Fl)// special case space fill { CalculatedChecksum+=0x20; OutBuf[Fl++]=0x20; if (!--outlen) goto finish; if (!--rlen) goto stop; } Rpos=Fl-Rpos; from=&OutBuf[Rpos]; to=&OutBuf[Fl]; Fl+=rlen; while (rlen--) { data=*from++; CalculatedChecksum+=data; *to++=data; if (!--outlen) goto finish; } stop: ; } } goto ok; finish: if (Flag&0xFE) return -1; //EXCESS_1BITS; ok: memcpy(&ReadChecksum, &in[pi], sizeof(ReadChecksum)); if (ReadChecksum == CalculatedChecksum) return pi+4; return -1; }
ExpandPbo
// input length is always known, making code a little less tedious
bool unpack_data(const byte *CompressedBuffer, int PackedSize, byte *DeCompressedBuffer,int DecompressedBufferSize) { const byte *CompressedPtr; byte *DecompressedPtr; size_t offset; int tmp, count; unsigned ReadChecksum, CalculatedChecksum; int rpos, rlen; int chunk, size; uchar FlagByte; byte b1, b2; if ( (PackedSize-=4)<=0) return false; // no room for checksum size = 0; DecompressedPtr =DeCompressedBuffer; /* Each data block is preceded by a byte telling us what to do with * the next 8 bytes. */ for (offset=0;offset < PackedSize; ) { FlagByte = CompressedBuffer[offset++];// grab the encoding byte for (count = tmp=1;tmp < 256 && offset < PackedSize ; tmp*=2, count++) { CompressedPtr = &CompressedBuffer[offset]; size = DecompressedPtr - DeCompressedBuffer; if (FlagByte & tmp) /* Append the byte to output */ { *DecompressedPtr++ = *CompressedPtr; offset++; } else { /* Read a pointer */ b1 = *CompressedPtr; b2 = *(CompressedPtr + 1); offset += 2; rpos = size - b1 - 256 * ( b2/16 ); rlen = b2 - 16 * (b2 / 16) + 3; if ((rpos + rlen) < 0) { size+=rlen; while (rlen--) *DecompressedPtr++ = ' '; } else if ((DeCompressedBuffer + rpos) > DecompressedPtr) return false;//PBO_STS_OUT_OF_BOUNDS; /* PAD the file with a block from another place in the file */ else if ((rpos + rlen) <= size) { while (rpos < 0) { *DecompressedPtr++ = ' '; rlen--; rpos++; } memcpy(DecompressedPtr, &DeCompressedBuffer[rpos], rlen); DecompressedPtr += rlen; size += rlen; } /* PAD the file with the block until size is rpos + rlen */ else if ((rpos + rlen) > size) { chunk = size - rpos; while (rlen > 0) { if (chunk > rlen) chunk = rlen; if (!chunk) return false;//PBO_STS_CHUNK_ZERO; memcpy(DecompressedPtr, &DeCompressedBuffer[rpos], chunk); DecompressedPtr += chunk; size += chunk; rlen -= chunk; } } } }//while temp }//while offset /* Last 4 bytes of the packed data is the checksum, unsigned int */ memcpy(&ReadChecksum, &CompressedBuffer[PackedSize], sizeof(ReadChecksum)); for (CalculatedChecksum = 0,DecompressedPtr = DeCompressedBuffer; DecompressedPtr < &DeCompressedBuffer[DecompressedBufferSize];) CalculatedChecksum+= *DecompressedPtr++; if (ReadChecksum != CalculatedChecksum) return false;//PBO_STS_CHECKSUM; return true; //PBO_STS_NO_ERROR; }