Lossless Compression of High Volume Numeric Data
(Extreme Numeric Compression - XNC)

January 2006

Introduction

High volume numeric data may be generated as a result of scientific simulations, or engineering plant monitoring. The "Extreme Numeric Compression" (XNC) algorithm has been developed for the storage high frequency measurements made on manufacturing plant.

Using the XNC algorithm, data will typically between 5-15% of the size of a comparable CSV file.

The XNC algorithm has been available on Sourceforge as a work in progress under the Mozilla Public Licence 1.1

Comparison with other compression algorithms

A comparison with other compression algorithms against uncompressed CSV is shown below. The compression tests where carried out against this file.

Algorithm	Size as a percentage of uncompressed CSV
Uncompressed CSV	100%
WinZip	32%
7Zip	22%
XNC	15%

The document Compression ratios contains more details.

Downloads

The XNC source code is available here:

Using a Subversion client client: http://sourceforge.net/svn/?group_id=184976

Using a browser: http://xnc.svn.sourceforge.net/viewvc/xnc/

A packaged download containing the source and compiled EXEs is available here:

http://sourceforge.net/project/showfiles.php?group_id=184976

Discussion of the compression algorithm

Example data

Throughout this paper, we will refer to the following sample data.


DATE,TIME,VOLT_AMPL,VOLT_ANGLE

38888,28688.800725,62815.170938,145.487718

38888,28688.820725,62821.990577,144.713594

38888,28688.840725,62824.107634,143.929042

38888,28688.860725,62822.000127,143.133750

38888,28688.880725,62827.696122,143.933594

Where

DATE	is the number of days since 30-DEC-1899
TIME	is the number of seconds past midnight
VOLT_AMPL	is the voltage amplitude
VOLT_ANGLE	is the voltage angle

The sample fragment, excluding header has 5 rows of 52 characters requiring 260 bytes.

The sample file, available on Sourceforge as part of the download has about 30,000 rows and is about 1.3Mb in size.

Four space saving techniques contribute to the XNC algorithm:

Store numbers as binary, not string;
Store differences between values in a series, not the absolute values;
Store as integers of the minimum precision; and
Store multiple small integers in a single byte.

These techniques are discussed in detail below.

Space saving technique #1. Store numbers as binary data, not string

The first space saving technique is to store numbers in binary form, rather than as strings.

Here is a fragment of the example data:


38888,28688.800725,62815.170938,145.487718

There are four fields of data:

Field #1:          5 characters;
Field #2:          12 characters;
Field #3:          12 characters;
Field #4:          10 characters

That's 39 characters, plus three commas and a carriage return + line feed pair at the end of the line.

A total of 44 characters or bytes per line.

Now, most compilers allow for a range of numeric types. For example:

Type	Range	Format
Byte	0..255	Unsigned 8-bit
Word	0..65535	Unsigned 16-bit
Longword	0..4294967295	Unsigned 32-bit
Shortint	-128..127	Signed 8-bit
Smallint	-32768..32767	Signed 16-bit
Longint	-2147483648..2147483647	Signed 32-bit
Int64	-2^63..2^63-1	Signed 64-bit

The first value from the sample data file (38888) will fit within the limits of SmallInt requiring only 16 bits, or 2 bytes of space.

The next two values (28688.800725,62815.170938), when multiplied by 1000000 will fit within the limits of Int64, requiring 64 bits or 8 bytes of space.

The final value (145.487718), when multiplied by 1000000 will fit within the limits of LongInt, requiring 32 bits or 4 bytes of space.

If the separating commas and carriage return + line feed pairs are omitted the space required per line is:

2 + 8 x 2 + 4 = 22 bytes

In summary - storing as binary rather than string

Storing the data below

38888,28688.800725,62815.170938,145.487718

As a string requires 44 bytes.

Storing the same data in binary form requires 22 bytes

Or 22 x 5 = 110 bytes for the five row fragment above.

An example of the Pascal source code to write numbers in binary format is here.

Space saving technique #2. Store differences between values in a series, not the absolute values

The next space saving technique is to store differences between rows in a series rather than the absolute values.

Looking at each column of the sample data in turn:

Column #1. Date

Absolute values	x 1000000	Differences
38888	38888000000
38888	38888000000	0
38888	38888000000	0
38888	38888000000	0
38888	38888000000	0

These differences are within the range of the byte data type (0..255) so we can store the DATE column like this:

First record (must store the absolute value): 2 bytes
Remaining four records: (4 x 1) byte

Total for Column #1 = 2 + (4 x 1) = 6 bytes

Column #2. Time

Absolute values	x 1000000	Differences
28688.80073	28688800725
28688.82073	28688820725	20000
28688.84073	28688840725	20000
28688.86073	28688860725	20000
28688.88073	28688880725	20000

These differences are within the range of the SmallInt data type (-32768..32767) so we can store the TIME column like this:

First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 2) byte

Total for Column #2 = 8 + (4 x 2) = 16 bytes

Column #3. VOLT_AMPL

Absolute values	x 1000000	Differences
62815.17094	62815170938
62821.99058	62821990577	6819639
62824.10763	62824107634	2117057
62822.00013	62822000127	-2107507
62827.69612	62827696122	5695995

These differences are within the range of the LongInt data type (-2147483648..2147483647) so we can store the VOLT_AMPL column like this:

First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 4) byte

Total for Column #2 = 8 + (4 x 4) = 24 bytes

Column #4 VOLT_ANGLE

Absolute values	x 1000000	Differences
145.487718	145487718
144.713594	144713594	-774124
143.929042	143929042	-784552
143.929042	143929042	0
143.933594	143933594	4552

These differences are within the range of the LongInt data type (-2147483648..2147483647) so we can store the VOLT_ANGLE column like this:

First record (must store the absolute value): 8 bytes
Remaining four records: (4 x 4) byte

Total for Column #2 = 8 + (4 x 4) = 24 bytes

In summary - Storing differences

Column #1 Date: 6 bytes

Column #2 Time: 16 bytes

Column #3 Volt Amplitude: 24 bytes

Column #4 Volt Angle 24 bytes

Total 70 bytes

Storing differences as binary, rather than absolute values reduces the size of the sample fragment from 110 bytes to 70 bytes

Space saving technique #3. Store as integers of the minimum precision

The next space saving technique is to ensure numbers are always stored as an integer type with the lowest possible precision (or range).

Consider the sample VOLT_ANGLE series:

Absolute values	x 1000000	Differences
145.487718	145487718
144.713594	144713594	-774124
143.929042	143929042	-784552
143.929042	143929042	0
143.933594	143933594	4552

When storing differences using LongInt precision (signed 32 bit integer), we managed to squeeze this down to 24 bytes:

First record (must store the absolute value): 8 bytes

Remaining four records: 4 x 4 byte

We can do better than this by mixing the integer types we used to store the differences.

The first record is a positive integer that fits within the limits of the LongWord data type: 4 bytes

The next two records don't change and still require: 2 x 4 bytes

The fourth record fits within the limit of the Byte data type: 1 x 1 byte

The last record fits within the limit of the Word data type: 2 x 2 bytes

This gives a size reduction from 24 to 15 bytes.

The size reduction from this technique reduce if the magnitude of the differences is highly variable as some space must be used to store information about the size of the next record. This is discussed more in the section on the "Control Byte".

For example, the sequence 0, 2, 3, 6, 9 will compress well using this technique. The sequence 0, 9823, 3, 5, 10000 will not.

Space saving technique #4. Store multiple small integers in a single byte

Where differences are very small, it may be possible to squeeze several numbers into a single byte.

Consider the contrived series:

Absolute values	x 1000000	Differences
145.487718	145487718
145.487719	145487719	1
145.487719	145487719	0
145.487718	145487718	-1
145.487717	145487717	-1

The values 1, 0 and -1 can all be stored as ShortInt giving space requirements;

First record (must store the absolute value): 8 bytes
Remaining 4 records fit within the limits of ShortInt: 1 x 4 bytes

Giving a size of 12 bytes.

Implementation of 8, 4 & 2 bit integers

8 bit integers

Now, a signed, one byte integer can store values between -128 and 127.

This can be shown as follows:

2⁸ = 256

To make allowance for negative numbers

2⁸ / 2 = -128 to 128

To make allowance for zero

-128 to 127

So, the range of possible values that can be store in an 8 bit integer is:

Low value = -2⁸ / 2 = -128

High value = 2⁸ / 2 - 1 = 127

-128..127

4 Bit Integers

If we allow four bits instead of eight per integer, we can squeeze two integers into a single byte.

So, the range of possible values that can be store in an 4 bit integer is:

Low value = -2⁴ / 2 = -8

High value = 2⁴ / 2 - 1 = 7

-8..7

2 Bit Integers

If we allow two bits instead of eight per integer, we can squeeze four integers into a single byte.

So, the range of possible values that can be store in an 2 bit integer is:

Low value = -2² / 2 = -2

High value = 2² / 2 - 1 = 1

-2..1

In summary, using two bit integers to store small differences rather than 8 bit integers.

Space before compression: 12 bytes

Space after compression:

First record (must store the absolute value): 8 bytes

Remaining 4 records fit within the limits of 2 bit integers, with all four fitting into a single byte: 1 x 1 bytes

Giving a size after compression of 9 bytes.

The XNC File Structure

The diagram below shows the XNC file structure.

There are three blocks to an XNC file:

The series count block. The first two bytes contains a count of the number of series in the file;
The header block. Each series has a header describing the series name, it's number of decimal places and the number of bytes occupied by the series data;
The data block. Each series' data is stored in a continuous block.

The series count block

The series count block is a two byte, unsigned integer (integers between 0..65535)

The header block

Looking at the sample data:


DATE,TIME,VOLT_AMPL,VOLT_ANGLE

38888,28688.800725,62815.170938,145.487718

38888,28688.820725,62821.990577,144.713594

38888,28688.840725,62824.107634,143.929042

38888,28688.860725,62822.000127,143.133750

38888,28688.880725,62827.696122,143.933594

Each series (or column) has a header and a body. The data in bold corresponds to the TIME series.

The diagram below shows the structure of the series header. There is one series header for each series.

For the TIME series, the series header would look like this:

The data block

The data block contains a repeating sequence of:

Control byte
Record data #1
Record data #n

This is illustrated below:

The control byte contains two pieces of information packed into a single byte:

Type of integer packed into 3 bits (integers from 0..7) representing one of the values 2Bit, 4Bit, 8Bit, 16Bit, 32Bit or 64Bit;
Number of records of this type, packed into 5 bits (integers from 1..32)

The structure of the control byte is shown below:

Looking again at the TIME series data:

TIME	TIME x 1,000,000	Difference
28688.80073	2,868,880,073
28688.82073	2,868,882,073	20,000
28688.84073	2,868,884,073	20,000
28688.86073	2,868,886,073	20,000
28688.88073	2,868,888,073	20,000

The first value of 2,868,880,073 must be stored as a 64 bit integer (requiring 6 bytes)

Values 2..5 can be stored as differences of 20,000 as 16bit integers (requiring 4 bytes each)

The size of the data block will be:

(1 x 6) + (4 x 4) + (2 x 1) = 24 bytes

The diagram below shows the structure of the TIME series data:

Implementation in Pascal

The Pascal implementation has been tested under BDS2006 (Delphi) and should compile (but has not been tested) under FreePascal.

The main compression and decompression classes are shown here: Compression Decompression.

Implementation in other languages

The XNC algorithm will be implemented in C#, but this work is not yet complete.

Future enhancements

At times the differences between records in a series are the same. In this case, the Control Byte should be modified to store a count of the number of records that are the same, rather than a value for each record;
The differences between records in a series may be within a finite set. When this is the case, an list of the differences could be stored in the series header with references to this index being stored for each record rather than the actual differences;
There should be some XNC algorithm version info stored in the file header;
Add functionality to limit the resolution of values being stored, say to 16bit. Tiw will be of use when compressing data from an analogue to digital converter where the precision of the conversion hardware is known.

References

Peter Lindstrom, Fast and Efficient Compression of Floating-point Data. http://www-static.cc.gatech.edu/~lindstro/papers/floatzip/paper.pdf

Vadim Engelson, Dag Fritzson, Peter Fritzson. Lossless Compression of High-volume Numerical Data from Simulations. http://www.ida.liu.se/~vaden/paper/compr/report.pdf

Lossless Compression of High Volume Numeric Data (Extreme Numeric Compression - XNC)