Lossless Compression of High Volume Numeric Data
(Extreme Numeric Compression - XNC)


January 2006

Introduction

High volume numeric data may be generated as a result of scientific simulations, or engineering plant monitoring. The "Extreme Numeric Compression" (XNC) algorithm has been developed for the storage high frequency measurements made on manufacturing plant.

Using the XNC algorithm, data will typically between 5-15% of the size of a comparable CSV file.

The XNC algorithm has been available on Sourceforge as a work in progress under the Mozilla Public Licence 1.1

Comparison with other compression algorithms


A comparison with other compression algorithms against uncompressed CSV is shown below. The compression tests where carried out against this file.

Algorithm
Size as a percentage
of uncompressed CSV
Uncompressed CSV
100%
WinZip
32%
7Zip
22%
XNC
15%



The document Compression ratios contains more details.

Downloads


The XNC source code is available here:

Using a Subversion client client: http://sourceforge.net/svn/?group_id=184976

Using a browser: http://xnc.svn.sourceforge.net/viewvc/xnc/

A packaged download containing the source and compiled EXEs is available here:

http://sourceforge.net/project/showfiles.php?group_id=184976

Discussion of the compression algorithm

Example data

Throughout this paper, we will refer to the following sample data.


DATE,TIME,VOLT_AMPL,VOLT_ANGLE
38888,28688.800725,62815.170938,145.487718
38888,28688.820725,62821.990577,144.713594
38888,28688.840725,62824.107634,143.929042
38888,28688.860725,62822.000127,143.133750
38888,28688.880725,62827.696122,143.933594


Where

DATE is the number of days since 30-DEC-1899
TIME is the number of seconds past midnight
VOLT_AMPL is the voltage amplitude
VOLT_ANGLE is the voltage angle
    

The sample fragment, excluding header has 5 rows of 52 characters requiring 260 bytes.

The sample file, available on Sourceforge as part of the download has about 30,000 rows and is about 1.3Mb in size.

Four space saving techniques contribute to the XNC algorithm:
  1. Store numbers as binary, not string;
  2. Store differences between values in a series, not the absolute values;
  3. Store as integers of the minimum precision; and
  4. Store multiple small integers in a single byte.

These techniques are discussed in detail below.

Space saving technique #1. Store numbers as binary data, not string


The first space saving technique is to store numbers in binary form, rather than as strings.

Here is a fragment of the example data:


38888,28688.800725,62815.170938,145.487718


There are four fields of data:

Field #1:          5 characters;
Field #2:          12 characters;
Field #3:          12 characters;
Field #4:          10 characters

That's 39 characters, plus three commas and a carriage return + line feed pair at the end of the line.

A total of 44 characters or bytes per line.

Now, most compilers allow for a range of numeric types. For example:

Type
Range
Format
Byte 0..255 Unsigned 8-bit
Word 0..65535 Unsigned 16-bit
Longword 0..4294967295 Unsigned 32-bit
Shortint -128..127 Signed 8-bit
Smallint -32768..32767 Signed 16-bit
Longint -2147483648..2147483647 Signed 32-bit
Int64 -2^63..2^63-1 Signed 64-bit

The first value from the sample data file (38888) will fit within the limits of SmallInt requiring only 16 bits, or 2 bytes of space.

The next two values (28688.800725,62815.170938), when multiplied by 1000000 will fit within the limits of Int64, requiring 64 bits or 8 bytes of space.

The final value (145.487718), when multiplied by 1000000 will fit within the limits of LongInt, requiring 32 bits or 4 bytes of space.

If the separating commas and carriage return + line feed pairs are omitted the space required per line is:

2 + 8 x 2 + 4 = 22 bytes


In summary - storing as binary rather than string

Storing the data below

38888,28688.800725,62815.170938,145.487718

As a string requires 44 bytes.

Storing the same data in binary form requires 22 bytes

Or 22 x 5 = 110 bytes for the five row fragment above.



An example of the Pascal source code to write numbers in binary format is here.

Space saving technique #2. Store differences between values in a series, not the absolute values


The next space saving technique is to store differences between rows in a series rather than the absolute values.

Looking at each column of the sample data in turn:

Column #1. Date


Absolute values x 1000000 Differences
38888 38888000000
38888 38888000000 0
38888 38888000000 0
38888 38888000000 0
38888 38888000000 0

These differences are within the range of the byte data type (0..255) so we can store the DATE column like this:

First record (must store the absolute value): 2 bytes
Remaining four records: (4 x 1) byte

Total for Column #1 = 2 + (4 x 1) = 6 bytes

Column #2. Time


Absolute values x 1000000 Differences
28688.80073 28688800725
28688.82073 28688820725 20000
28688.84073 28688840725 20000
28688.86073 28688860725 20000
28688.88073 28688880725 20000

These differences are within the range of the SmallInt data type (-32768..32767) so we can store the TIME column like this:
 
First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 2) byte

Total for Column #2 = 8 + (4 x 2) = 16 bytes

Column #3. VOLT_AMPL  


Absolute values x 1000000 Differences
62815.17094 62815170938
62821.99058 62821990577 6819639
62824.10763 62824107634 2117057
62822.00013 62822000127 -2107507
62827.69612 62827696122 5695995

These differences are within the range of the LongInt data type (-2147483648..2147483647) so we can store the VOLT_AMPL column like this:
 
First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 4) byte

Total for Column #2 = 8 + (4 x 4) = 24 bytes

Column #4 VOLT_ANGLE

 
Absolute values x 1000000 Differences
145.487718 145487718
144.713594 144713594 -774124
143.929042 143929042 -784552
143.929042 143929042 0
143.933594 143933594 4552

These differences are within the range of the LongInt data type (-2147483648..2147483647) so we can store the VOLT_ANGLE column like this:

First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 4) byte

Total for Column #2 = 8 + (4 x 4) = 24 bytes


In summary - Storing differences
 
Column #1 Date:                     6 bytes
Column #2 Time:                    16 bytes
Column #3 Volt Amplitude:     24 bytes
Column #4 Volt Angle             24 bytes

Total                                        70 bytes

Storing differences as binary, rather than absolute values reduces the size of the sample fragment from 110 bytes to 70 bytes


Space saving technique #3. Store as integers of the minimum precision


The next space saving technique is to ensure numbers are always stored as an integer type with the lowest possible precision (or range).

Consider the sample VOLT_ANGLE series:

Absolute values x 1000000 Differences
145.487718 145487718
144.713594 144713594 -774124
143.929042 143929042 -784552
143.929042 143929042 0
143.933594 143933594 4552

When storing differences using LongInt precision (signed 32 bit integer), we managed to squeeze this down to 24 bytes:

First record (must store the absolute value): 8 bytes

Remaining four records: 4 x 4 byte
 
We can do better than this by mixing the integer types we used to store the differences.
 
The first record is a positive integer that fits within the limits of the LongWord data type: 4 bytes

The next two records don't change and still require: 2 x 4 bytes

The fourth record fits within the limit of the Byte data type: 1 x 1 byte

The last record fits within the limit of the Word data type: 2 x 2 bytes 

This gives a size reduction from 24 to 15 bytes.

The size reduction from this technique reduce if the magnitude of the differences is highly variable as some space must be used to store information about the size of the next record. This is discussed more in the section on the "Control Byte".

For example, the sequence 0, 2, 3, 6, 9 will compress well using this technique. The sequence 0, 9823, 3, 5, 10000 will not.

Space saving technique #4. Store multiple small integers in a single byte

Where differences are very small, it may be possible to squeeze several numbers into a single byte.

Consider the contrived series:
 
Absolute values x 1000000 Differences
145.487718 145487718
145.487719 145487719 1
145.487719 145487719 0
145.487718 145487718 -1
145.487717 145487717 -1


The values 1, 0 and -1 can all be stored as ShortInt giving space requirements;

First record (must store the absolute value): 8 bytes
Remaining 4 records fit within the limits of ShortInt: 1 x 4 bytes

Giving a size of 12 bytes.

Implementation of 8, 4 & 2 bit integers

8 bit integers
Now, a signed, one byte integer can store values between -128 and 127.
 
This can be shown as follows:
 
28 =  256 

To make allowance for negative numbers 

28 / 2 =  -128 to 128 

To make allowance for zero 

-128 to 127 

So, the range of possible values that can be store in an 8 bit integer is:

Low value = -28 / 2 = -128

High value = 28 / 2 - 1 = 127

-128..127
 
4 Bit Integers

If we allow four bits instead of eight per integer, we can squeeze two integers into a single byte. 

So, the range of possible values that can be store in an 4 bit integer is:

Low value = -24 / 2 = -8

High value = 24 / 2 - 1 = 7

-8..7

2 Bit Integers

If we allow two bits instead of eight per integer, we can squeeze four integers into a single byte.
 
So, the range of possible values that can be store in an 2 bit integer is:
 
Low value = -22 / 2 = -2

High value = 22 / 2 - 1 = 1

-2..1


In summary, using two bit integers to store small differences rather than 8 bit integers.

Space before compression:   12 bytes

Space after compression:

First record (must store the absolute value): 8 bytes
Remaining 4 records fit within the limits of 2 bit integers, with all four fitting into a single byte: 1 x 1 bytes

Giving a size after compression of 9 bytes.


The XNC File Structure


The diagram below shows the XNC file structure.



There are three blocks to an XNC file:
  1. The series count block. The first two bytes contains a count of the number of series in the file;
  2. The header block. Each series has a header describing the series name, it's number of decimal places and  the number of bytes occupied by the series data;
  3. The data block. Each series' data is stored in a continuous block.

The series count block

The series count block is a two byte, unsigned integer (integers between 0..65535)

The header block

Looking at the sample data:


DATE,TIME,VOLT_AMPL,VOLT_ANGLE
38888,28688.800725,62815.170938,145.487718
38888,28688.820725,62821.990577,144.713594
38888,28688.840725,62824.107634,143.929042
38888,28688.860725,62822.000127,143.133750
38888,28688.880725,62827.696122,143.933594


Each series (or column) has a header and a body. The data in bold corresponds to the TIME series.

The diagram below shows the structure of the series header. There is one series header for each series.



For the TIME series, the series header would look like this:


The data block

The data block contains a repeating sequence of:
  1. Control byte
  2. Record data #1
  3. Record data #n
This is illustrated below:



The control byte contains two pieces of information packed into a single byte:
  1. Type of integer packed into 3 bits (integers from 0..7) representing one of the values 2Bit, 4Bit, 8Bit, 16Bit, 32Bit or 64Bit;
  2. Number of records of this type, packed into 5 bits (integers from 1..32)
The structure of the control byte is shown below:



Looking again at the TIME series data:

TIME
TIME x 1,000,000
Difference
28688.80073
2,868,880,073
28688.82073 2,868,882,073 20,000
28688.84073 2,868,884,073 20,000
28688.86073 2,868,886,073 20,000
28688.88073 2,868,888,073 20,000

The first value of 2,868,880,073 must be stored as a 64 bit integer (requiring 6 bytes)

Values 2..5 can be stored as differences of 20,000 as 16bit integers (requiring 4 bytes each)

The size of the data block will be:

(1 x 6) + (4 x 4) + (2 x 1) = 24 bytes

The diagram below shows the structure of the TIME series data:


Implementation in Pascal


The Pascal implementation has been tested under BDS2006 (Delphi) and should compile (but has not been tested) under FreePascal.

The main compression and decompression classes are shown here: Compression Decompression.

Implementation in other languages


The XNC algorithm will be implemented in C#, but this work is not yet complete.

Future enhancements

  1. At times the differences between records in a series are the same. In this case, the Control Byte should be modified to store a count of the number of records that are the same, rather than a value for each record;

  2. The differences between records in a series may be within a finite set. When this is the case, an list of the differences could be stored in the series header with references to this index being stored for each record rather than the actual differences;

  3. There should be some XNC algorithm version info stored in the file header;

  4. Add functionality to limit the resolution of values being stored, say to 16bit. Tiw will be of use when compressing data from an analogue to digital converter where the precision of the conversion hardware is known.

References


Peter Lindstrom, Fast and Efficient Compression of Floating-point Data. http://www-static.cc.gatech.edu/~lindstro/papers/floatzip/paper.pdf

Vadim Engelson, Dag Fritzson, Peter Fritzson. Lossless Compression of High-volume Numerical Data from Simulations. http://www.ida.liu.se/~vaden/paper/compr/report.pdf