(Extreme Numeric Compression - XNC)

January 2006

Using the XNC algorithm, data will typically between 5-15% of the size of a comparable CSV file.

The XNC algorithm has been available on Sourceforge as a work in progress under the Mozilla Public Licence 1.1

A comparison with other compression algorithms against uncompressed CSV is shown below. The compression tests where carried out against this file.

Algorithm |
Size
as a percentage of uncompressed CSV |

Uncompressed CSV |
100% |

WinZip |
32% |

7Zip |
22% |

XNC |
15% |

The document Compression ratios contains more details.

The XNC source code is available here:

Using a Subversion client client: http://sourceforge.net/svn/?group_id=184976

Using a browser: http://xnc.svn.sourceforge.net/viewvc/xnc/

A packaged download containing the source and compiled EXEs is available here:

http://sourceforge.net/project/showfiles.php?group_id=184976

DATE,TIME,VOLT_AMPL,VOLT_ANGLE

38888,28688.800725,62815.170938,145.487718

38888,28688.820725,62821.990577,144.713594

38888,28688.840725,62824.107634,143.929042

38888,28688.860725,62822.000127,143.133750

38888,28688.880725,62827.696122,143.933594

Where

DATE | is the number of days since 30-DEC-1899 |

TIME | is the number of seconds past midnight |

VOLT_AMPL | is the voltage amplitude |

VOLT_ANGLE | is the voltage angle |

The sample fragment, excluding header has 5 rows of 52 characters requiring 260 bytes.

The sample file, available on Sourceforge as part of the download has about 30,000 rows and is about 1.3Mb in size.

Four space saving techniques contribute to the XNC algorithm:

- Store numbers as binary, not string;
- Store differences between values in a series, not the absolute values;
- Store as integers of the minimum precision; and
- Store multiple small integers in a single byte.

These techniques are discussed in detail below.

The first space saving technique is to store numbers in binary form, rather than as strings.

Here is a fragment of the example data:

38888,28688.800725,62815.170938,145.487718

There are four fields of data:

Field #1: 5 characters;

Field #2: 12 characters;

Field #3: 12 characters;

Field #4: 10 characters

That's 39 characters, plus three commas and a carriage return + line feed pair at the end of the line.

A total of 44 characters or bytes per line.

Now, most compilers allow for a range of numeric types. For example:

Type |
Range |
Format |

Byte | 0..255 | Unsigned 8-bit |

Word | 0..65535 | Unsigned 16-bit |

Longword | 0..4294967295 | Unsigned 32-bit |

Shortint | -128..127 | Signed 8-bit |

Smallint | -32768..32767 | Signed 16-bit |

Longint | -2147483648..2147483647 | Signed 32-bit |

Int64 | -2^63..2^63-1 | Signed 64-bit |

The first value from the sample data file (38888) will fit within the limits of SmallInt requiring only 16 bits, or 2 bytes of space.

The next two values (28688.800725,62815.170938), when multiplied by 1000000 will fit within the limits of Int64, requiring 64 bits or 8 bytes of space.

The final value (145.487718), when multiplied by 1000000 will fit within the limits of LongInt, requiring 32 bits or 4 bytes of space.

If the separating commas and carriage return + line feed pairs are omitted the space required per line is:

2 + 8 x 2 + 4 = 22 bytes

38888,28688.800725,62815.170938,145.487718

As a string requires 44 bytes.

Storing the same data in binary form requires 22 bytes

Or 22 x 5 = 110 bytes for the five row fragment above.

An example of the Pascal source code to write numbers in binary format is here.

The next space saving technique is to store differences between rows in a series rather than the absolute values.

Looking at each column of the sample data in turn:

Absolute values | x 1000000 | Differences |

38888 | 38888000000 | |

38888 | 38888000000 | 0 |

38888 | 38888000000 | 0 |

38888 | 38888000000 | 0 |

38888 | 38888000000 | 0 |

These differences are within the range of the byte data type (0..255) so we can store the DATE column like this:

First record (must store the absolute value): 2 bytes

Remaining four records: (4 x 1) byte

Total for Column #1 = 2 + (4 x 1) = 6 bytes

Absolute values | x 1000000 | Differences |

28688.80073 | 28688800725 | |

28688.82073 | 28688820725 | 20000 |

28688.84073 | 28688840725 | 20000 |

28688.86073 | 28688860725 | 20000 |

28688.88073 | 28688880725 | 20000 |

These differences are within the range of the SmallInt data type (-32768..32767) so we can store the TIME column like this:

First record (must store the absolute value): 8 bytes

Remaining four records: (4 x 2) byte

Total for Column #2 = 8 + (4 x 2) = 16 bytes

Absolute values | x 1000000 | Differences |

62815.17094 | 62815170938 | |

62821.99058 | 62821990577 | 6819639 |

62824.10763 | 62824107634 | 2117057 |

62822.00013 | 62822000127 | -2107507 |

62827.69612 | 62827696122 | 5695995 |

These differences are within the range of the LongInt data type (-2147483648..2147483647) so we can store the VOLT_AMPL column like this:

First record (must store the absolute value): 8 bytes

Remaining four records: (4 x 4) byte

Total for Column #2 = 8 + (4 x 4) = 24 bytes

Absolute values | x 1000000 | Differences |

145.487718 | 145487718 | |

144.713594 | 144713594 | -774124 |

143.929042 | 143929042 | -784552 |

143.929042 | 143929042 | 0 |

143.933594 | 143933594 | 4552 |

These differences are within the range of the LongInt data type (-2147483648..2147483647) so we can store the VOLT_ANGLE column like this:

First record (must store the absolute value): 8 bytes

Remaining four records: (4 x 4) byte

Total for Column #2 = 8 + (4 x 4) = 24 bytes

Total 70 bytes

Storing differences as binary, rather than absolute values reduces the size of the sample fragment from 110 bytes to 70 bytes

The next space saving technique is to ensure numbers are always stored as an integer type with the lowest possible precision (or range).

Consider the sample VOLT_ANGLE series:

Absolute values | x 1000000 | Differences |

145.487718 | 145487718 | |

144.713594 | 144713594 | -774124 |

143.929042 | 143929042 | -784552 |

143.929042 | 143929042 | 0 |

143.933594 | 143933594 | 4552 |

When storing differences using LongInt precision (signed 32 bit integer), we managed to squeeze this down to 24 bytes:

First record (must store the absolute value): 8 bytes

Remaining four records: 4 x 4 byte

We can do better than this by mixing the integer types we used to store the differences.

The first record is a positive integer that fits within the limits of the LongWord data type: 4 bytes

The next two records don't change and still require: 2 x 4 bytes

The fourth record fits within the limit of the Byte data type: 1 x 1 byte

The last record fits within the limit of the Word data type: 2 x 2 bytes

This gives a size reduction from 24 to 15 bytes.

The size reduction from this technique reduce if the magnitude of the differences is highly variable as some space must be used to store information about the size of the next record. This is discussed more in the section on the "Control Byte".

For example, the sequence 0, 2, 3, 6, 9 will compress well using this technique. The sequence 0, 9823, 3, 5, 10000 will not.

Consider the contrived series:

Absolute values | x 1000000 | Differences |

145.487718 | 145487718 | |

145.487719 | 145487719 | 1 |

145.487719 | 145487719 | 0 |

145.487718 | 145487718 | -1 |

145.487717 | 145487717 | -1 |

The values 1, 0 and -1 can all be stored as ShortInt giving space requirements;

First record (must store the absolute value): 8 bytes

Remaining 4 records fit within the limits of ShortInt: 1 x 4 bytes

Giving a size of 12 bytes.

This can be shown as follows:

2

To make allowance for negative numbers

2

To make allowance for zero

-128 to 127

So, the range of possible values that can be store in an 8 bit integer is:

Low value = -2

High value = 2

-128..127

If we allow four bits instead of eight per integer, we can squeeze two integers into a single byte.

So, the range of possible values that can be store in an 4 bit integer is:

Low value = -2

High value = 2

-8..7

If we allow two bits instead of eight per integer, we can squeeze four integers into a single byte.

So, the range of possible values that can be store in an 2 bit integer is:

Low value = -2

High value = 2

-2..1

Space before compression: 12 bytes

Space after compression:

First record (must store the absolute value): 8 bytes

Giving a size after compression of 9 bytes.

The diagram below shows the XNC file structure.

There are three blocks to an XNC file:

- The series count block. The first two bytes contains a count of the number of series in the file;
- The header block. Each series has a header describing the series name, it's number of decimal places and the number of bytes occupied by the series data;
- The data block. Each series' data is stored in a continuous block.

DATE,TIME,VOLT_AMPL,VOLT_ANGLE

`38888,28688.800725,62815.170938,145.487718`

`38888,28688.820725,62821.990577,144.713594`

`38888,28688.840725,62824.107634,143.929042`

`38888,28688.860725,62822.000127,143.133750`

`38888,28688.880725,62827.696122,143.933594`

Each series (or column) has a header and a body. The data in bold corresponds to the TIME series.

The diagram below shows the structure of the series header. There is one series header for each series.

For the TIME series, the series header would look like this:

- Control byte
- Record data #1
- Record data #n

The control byte contains two pieces of information packed into a single byte:

- Type of integer packed into 3 bits (integers from 0..7) representing one of the values 2Bit, 4Bit, 8Bit, 16Bit, 32Bit or 64Bit;
- Number of records of this type, packed into 5 bits (integers from 1..32)

Looking again at the TIME series data:

TIME |
TIME
x 1,000,000 |
Difference |

28688.80073 |
2,868,880,073 | |

28688.82073 | 2,868,882,073 | 20,000 |

28688.84073 | 2,868,884,073 | 20,000 |

28688.86073 | 2,868,886,073 | 20,000 |

28688.88073 | 2,868,888,073 | 20,000 |

The first value of 2,868,880,073 must be stored as a 64 bit integer (requiring 6 bytes)

Values 2..5 can be stored as differences of 20,000 as 16bit integers (requiring 4 bytes each)

The size of the data block will be:

(1 x 6) + (4 x 4) + (2 x 1) = 24 bytes

The diagram below shows the structure of the TIME series data:

The Pascal implementation has been tested under BDS2006 (Delphi) and should compile (but has not been tested) under FreePascal.

The main compression and decompression classes are shown here: Compression Decompression.

The XNC algorithm will be implemented in C#, but this work is not yet complete.

- At times the differences between records in a series are the
same. In this
case, the Control Byte should be modified to store a count of the
number of records that are the same, rather than a value for each
record;

- The differences between records in a series may be within a
finite set.
When this is the case, an list of the differences could be stored in
the series header with references to this index being stored for each
record rather than the actual differences;

- There should be some XNC algorithm version info stored in the
file
header;

- Add functionality to limit the resolution of values being stored,
say to 16bit. Tiw will be of use when compressing data from an analogue
to digital converter where the precision of the conversion hardware is
known.

Peter Lindstrom, Fast and Efficient Compression of Floating-point Data. http://www-static.cc.gatech.edu/~lindstro/papers/floatzip/paper.pdf

Vadim Engelson, Dag Fritzson, Peter Fritzson. Lossless Compression of High-volume Numerical Data from Simulations. http://www.ida.liu.se/~vaden/paper/compr/report.pdf