Lossless Compression of High Volume Numeric Data
(Extreme Numeric Compression - XNC)
January 2006
Introduction
High volume numeric data may be generated as a result of scientific
simulations, or engineering plant monitoring. The "Extreme
Numeric
Compression" (XNC) algorithm has been developed for the storage
high
frequency measurements made on manufacturing plant.
Using the XNC algorithm, data will typically between 5-15% of the size
of a comparable CSV file.
The XNC algorithm has been available on Sourceforge as a work
in
progress under the Mozilla
Public Licence 1.1
Comparison with other compression algorithms
A comparison with other compression algorithms against uncompressed CSV
is shown below. The compression tests where carried out against this
file.
Algorithm
|
Size
as a percentage
of uncompressed CSV
|
Uncompressed CSV
|
100%
|
WinZip
|
32%
|
7Zip
|
22%
|
XNC
|
15%
|
The document Compression ratios
contains more details.
Downloads
The XNC source code is available here:
Using a Subversion client
client: http://sourceforge.net/svn/?group_id=184976
Using a browser: http://xnc.svn.sourceforge.net/viewvc/xnc/
A packaged download containing the source and compiled EXEs is
available here:
http://sourceforge.net/project/showfiles.php?group_id=184976
Discussion of the compression algorithm
Example data
Throughout this paper, we will refer to the following sample data.
DATE,TIME,VOLT_AMPL,VOLT_ANGLE
38888,28688.800725,62815.170938,145.487718
38888,28688.820725,62821.990577,144.713594
38888,28688.840725,62824.107634,143.929042
38888,28688.860725,62822.000127,143.133750
38888,28688.880725,62827.696122,143.933594
Where
DATE |
is the number of
days since 30-DEC-1899 |
TIME |
is the number of
seconds past midnight |
VOLT_AMPL |
is the voltage
amplitude |
VOLT_ANGLE |
is the voltage
angle |
The sample fragment, excluding header has 5 rows of 52 characters
requiring 260 bytes.
The sample
file, available on Sourceforge as part of the download has
about 30,000 rows and is about 1.3Mb in size.
Four space saving techniques contribute to the XNC algorithm:
- Store numbers as binary, not string;
- Store differences between values in a series, not the absolute
values;
- Store as integers of the minimum precision; and
- Store multiple small integers in a single byte.
These techniques are discussed in detail below.
Space saving technique #1. Store numbers as binary data, not string
The first space saving technique is to store numbers in binary form,
rather than as strings.
Here is a fragment of the example data:
38888,28688.800725,62815.170938,145.487718
There are four fields of data:
Field #1: 5
characters;
Field #2: 12
characters;
Field #3: 12
characters;
Field #4: 10
characters
That's 39 characters, plus three commas and a carriage return +
line
feed pair at the end of the line.
A total of 44 characters or bytes per line.
Now, most compilers allow for a range of numeric types. For example:
Type
|
Range
|
Format
|
Byte |
0..255 |
Unsigned 8-bit |
Word |
0..65535 |
Unsigned 16-bit |
Longword |
0..4294967295 |
Unsigned 32-bit |
Shortint |
-128..127 |
Signed 8-bit |
Smallint |
-32768..32767 |
Signed 16-bit |
Longint |
-2147483648..2147483647 |
Signed 32-bit |
Int64 |
-2^63..2^63-1 |
Signed 64-bit |
The first value from the sample data file (38888) will fit within the
limits of SmallInt
requiring only 16 bits, or 2 bytes of space.
The next two values (28688.800725,62815.170938), when multiplied by
1000000 will fit within the limits of Int64, requiring 64 bits or 8
bytes of space.
The final value (145.487718), when multiplied by 1000000 will fit
within the limits of LongInt, requiring 32 bits or 4 bytes of space.
If the separating commas and carriage return + line feed pairs are
omitted the space required per line is:
2 + 8 x 2 + 4 = 22 bytes
In summary - storing as binary
rather than string
Storing the data below
38888,28688.800725,62815.170938,145.487718
As a string requires 44 bytes.
Storing the same data in binary form requires 22 bytes
Or 22 x 5 = 110 bytes for the five row fragment above.
An example of the Pascal source code to write numbers in binary format
is here.
Space saving technique #2. Store differences between values in a
series, not the absolute values
The next space saving technique is to store differences between rows in
a series rather than the absolute values.
Looking at each column of the sample data in turn:
Column #1. Date
Absolute
values |
x
1000000 |
Differences |
38888 |
38888000000 |
|
38888 |
38888000000 |
0 |
38888 |
38888000000 |
0 |
38888 |
38888000000 |
0 |
38888 |
38888000000 |
0 |
These differences are within the range of the byte data type (0..255)
so we can store the DATE
column like this:
First record (must store the absolute value): 2 bytes
Remaining four records: (4 x 1) byte
Total for Column #1 = 2 + (4 x 1) = 6 bytes
Column #2. Time
Absolute
values |
x
1000000 |
Differences |
28688.80073 |
28688800725 |
|
28688.82073 |
28688820725 |
20000 |
28688.84073 |
28688840725 |
20000 |
28688.86073 |
28688860725 |
20000 |
28688.88073 |
28688880725 |
20000 |
These differences are within the range of the SmallInt data type
(-32768..32767) so we can store the
TIME column like this:
First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 2) byte
Total for Column #2 = 8 + (4 x 2) = 16 bytes
Column #3. VOLT_AMPL
Absolute
values |
x
1000000 |
Differences |
62815.17094 |
62815170938 |
|
62821.99058 |
62821990577 |
6819639 |
62824.10763 |
62824107634 |
2117057 |
62822.00013 |
62822000127 |
-2107507 |
62827.69612 |
62827696122 |
5695995 |
These differences are within the range of the LongInt data type
(-2147483648..2147483647) so we can store the
VOLT_AMPL column like this:
First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 4) byte
Total for Column #2 = 8 + (4 x 4) = 24 bytes
Column #4 VOLT_ANGLE
Absolute
values |
x
1000000 |
Differences |
145.487718 |
145487718 |
|
144.713594 |
144713594 |
-774124 |
143.929042 |
143929042 |
-784552 |
143.929042 |
143929042 |
0 |
143.933594 |
143933594 |
4552 |
These differences are within the range of the LongInt data type
(-2147483648..2147483647) so we can store the
VOLT_ANGLE column like this:
First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 4) byte
Total for Column #2 = 8 + (4 x 4) = 24 bytes
In summary - Storing differences
Column #1
Date:
6 bytes
Column #2
Time:
16 bytes
Column #3 Volt Amplitude: 24 bytes
Column #4 Volt
Angle
24 bytes
Total
70 bytes
Storing differences as binary, rather than absolute values reduces the
size of the sample fragment from 110 bytes to 70 bytes
Space saving technique #3.
Store as integers of the minimum
precision
The next space saving technique is to ensure numbers are always stored
as an integer type with the lowest possible precision (or range).
Consider the sample VOLT_ANGLE series:
Absolute
values |
x
1000000 |
Differences |
145.487718 |
145487718 |
|
144.713594 |
144713594 |
-774124 |
143.929042 |
143929042 |
-784552 |
143.929042 |
143929042 |
0 |
143.933594 |
143933594 |
4552 |
When storing differences using LongInt precision (signed 32 bit
integer), we managed to squeeze this down to 24 bytes:
First record (must store the absolute value): 8 bytes
Remaining four records: 4 x 4 byte
We can do better than this by mixing the integer types we used to store
the differences.
The first record is a positive integer that fits within the limits of
the LongWord data type: 4 bytes
The next two records don't change and still require: 2 x 4 bytes
The fourth record fits within the limit of the Byte data type: 1 x 1
byte
The last record fits within the limit of the Word data type: 2 x 2
bytes
This gives a size reduction from 24 to 15 bytes.
The size reduction from this technique reduce if the magnitude of the
differences is highly variable as some space must be used to store
information
about the size of the next record. This is discussed more in the
section on the "Control Byte".
For example, the sequence 0, 2, 3, 6, 9 will compress well using this
technique. The sequence 0, 9823, 3, 5, 10000 will not.
Space saving technique #4.
Store multiple small integers in a
single
byte
Where differences are very small, it may be possible to squeeze several
numbers into a single byte.
Consider the contrived series:
Absolute
values |
x
1000000 |
Differences |
145.487718 |
145487718 |
|
145.487719 |
145487719 |
1 |
145.487719 |
145487719 |
0 |
145.487718 |
145487718 |
-1 |
145.487717 |
145487717 |
-1 |
The values 1, 0 and -1 can all be stored as ShortInt giving
space
requirements;
First record (must store the absolute value): 8 bytes
Remaining 4 records fit within the limits of ShortInt: 1 x 4 bytes
Giving a size of 12 bytes.
Implementation of 8, 4 & 2 bit integers
8 bit integers
Now, a signed, one byte integer can store values between -128
and 127.
This can be shown as follows:
28 = 256
To make allowance for negative numbers
28 / 2 = -128 to 128
To make allowance for zero
-128 to 127
So, the range of possible values that can be store in an 8 bit integer
is:
Low value = -28 / 2 = -128
High value = 28 / 2 - 1 = 127
-128..127
4 Bit Integers
If we allow four bits instead of eight per integer, we can squeeze two
integers into a single byte.
So, the range of possible values that can be store in an 4 bit integer
is:
Low value = -24 / 2 = -8
High value = 24 / 2 - 1 = 7
-8..7
2 Bit Integers
If we allow two bits instead of eight per integer, we can squeeze four
integers into a single byte.
So, the range of possible values that can be store in an 2 bit integer
is:
Low value = -22 / 2 = -2
High value = 22 / 2 - 1 = 1
-2..1
In summary, using two bit integers to
store small differences rather
than 8 bit integers.
Space before compression: 12 bytes
Space after compression:
First record (must store the absolute value): 8 bytes
Remaining 4 records fit within the limits of 2 bit integers, with
all
four fitting into a single byte: 1 x 1 bytes
Giving a size after compression of 9 bytes.
The XNC File Structure
The diagram below shows the XNC file structure.
There are three blocks to an XNC file:
- The series count block. The first two
bytes contains a count of the number of series in the file;
- The header block. Each series has a
header describing the series name, it's number of decimal places
and the number of bytes occupied by the series data;
- The data block. Each series' data is stored in a
continuous block.
The series count block
The series count block is a two byte, unsigned integer (integers
between 0..65535)
The header block
Looking at the sample data:
DATE,TIME,VOLT_AMPL,VOLT_ANGLE
38888,28688.800725,62815.170938,145.487718
38888,28688.820725,62821.990577,144.713594
38888,28688.840725,62824.107634,143.929042
38888,28688.860725,62822.000127,143.133750
38888,28688.880725,62827.696122,143.933594
Each series (or column) has a header and a body. The data in bold
corresponds to the TIME series.
The diagram below shows the structure of the series header. There is
one series header for each series.
For the TIME series, the series header would look like this:
The data block
The data block contains a repeating sequence of:
- Control byte
- Record data #1
- Record data #n
This is illustrated below:
The control byte contains two pieces of information packed into a
single byte:
- Type of integer packed into 3 bits (integers from 0..7)
representing one of the values 2Bit, 4Bit, 8Bit, 16Bit, 32Bit or 64Bit;
- Number of records of this type, packed into 5 bits (integers from
1..32)
The structure of the control byte is shown below:
Looking again at the TIME series data:
TIME
|
TIME
x 1,000,000
|
Difference
|
28688.80073
|
2,868,880,073 |
|
28688.82073 |
2,868,882,073 |
20,000
|
28688.84073 |
2,868,884,073 |
20,000 |
28688.86073 |
2,868,886,073 |
20,000 |
28688.88073 |
2,868,888,073 |
20,000 |
The first value of 2,868,880,073 must be stored as a 64 bit integer
(requiring 6 bytes)
Values 2..5 can be stored as differences of 20,000 as 16bit integers
(requiring 4 bytes each)
The size of the data block will be:
(1 x 6) + (4 x 4) + (2 x 1) = 24 bytes
The diagram below shows the structure of the TIME series data:
Implementation in Pascal
The Pascal implementation has been tested under BDS2006 (Delphi) and
should compile (but has not been tested) under FreePascal.
The main compression and decompression classes are shown here:
Compression Decompression.
Implementation in other languages
The XNC algorithm will be implemented in C#, but this work is not yet
complete.
Future enhancements
- At times the differences between records in a series are the
same. In this
case, the Control Byte should be modified to store a count of the
number of records that are the same, rather than a value for each
record;
- The differences between records in a series may be within a
finite set.
When this is the case, an list of the differences could be stored in
the series header with references to this index being stored for each
record rather than the actual differences;
- There should be some XNC algorithm version info stored in the
file
header;
- Add functionality to limit the resolution of values being stored,
say to 16bit. Tiw will be of use when compressing data from an analogue
to digital converter where the precision of the conversion hardware is
known.
References
Peter Lindstrom, Fast and Efficient Compression of Floating-point Data.
http://www-static.cc.gatech.edu/~lindstro/papers/floatzip/paper.pdf
Vadim Engelson, Dag Fritzson, Peter Fritzson. Lossless Compression of
High-volume Numerical Data from Simulations. http://www.ida.liu.se/~vaden/paper/compr/report.pdf