How to be Minimally Redundant (or “A Splitting Headache”)

Richard

June 11, 2013

I’ve recently been investigating the command-line utility zfec, which is a lot like the UNIX “split” utility, except, using a technique called erasure coding, you can choose to split your file into M pieces, any K of which are enough to recreate the file using the complimentary [sic] zunfec command.

Concrete example: a 100K file can be split into 10 (or more) pieces, each just over 25K long, and zunfec can recreate the original from any 4 of them.

Any 4. You might expect this to incur a lot of overhead in each piece, but it turns out it’s just a header of four bytes or less. Pretty much the same as cutting the file into pieces.

This amazed me. How could this possibly work? How can you split data into M pieces so that any K of them is enough to reconstruct them? Linear algebra to the rescue!

Suppose we want to encode a twelve-character text string into a bunch of arrays, each with four values, any three of which are sufficient to reconstruct the original. Let’s use the string “Lovefromdawn.” Here’s what you do:

$$ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \\ 1 & 5 & 25 \\ \vdots & \vdots & \vdots \end{bmatrix} \begin{bmatrix} L & o & v & e \\ f & r & o & m \\ d & a & w & n \end{bmatrix} = \begin{bmatrix} L + f + d & o + r + a & v + o + w & e + m + n \\ L + 2f + 4d & o + 2r + 4a & v + 2o + 4w & e + 2m + 4n \\ L + 3f + 9d & o + 3r + 9a & v + 3o + 9w & e + 3m + 9n \\ L + 4f + 16d & o + 4r + 16a & v + 4o + 16w & e + 4m + 16n \\ L + 5f + 25d & o + 5r + 25a & v + 5o + 25w & e + 5m + 25n \\ \vdots & \vdots & \vdots & \vdots \end{bmatrix} $$

The matrix furthest to the left is a Vandermonde matrix, which is a matrix where each row forms a geometric series starting with 1. A square Vandermonde matrix has the special property that as long as the second column has no repeated elements, the matrix is invertible (and in fact, there is a special algorithm to invert it quickly).

The split pieces correspond to rows of the matrix on the right. Choose any three of them; let’s say row 1, 2 and 4. Once we know what row numbers they are (and this explains the tiny header), we get the following system of equations:

$$ \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 4 & 16 \end{bmatrix} X = \begin{bmatrix} L + f + d & o + r + a & v + o + w & e + m + n \\ L + 2f + 4d & o + 2r + 4a & v + 2o + 4w & e + 2m + 4n \\ L + 4f + 16d & o + 4r + 16a & v + 4o + 16w & e + 4m + 16n \end{bmatrix} $$

We know the leftmost matrix is invertible, so multiply both sides by the inverse, and we solve for X, recovering “Lovefromdawn”. Wowza!

But wait! One problem remains: if we use standard integer arithmetic, the matrix on the right ends up with a lot of values larger than 255, and we can’t store it in a byte array. That ain’t no good.

Luckily, it turns out that most theorems dealing with matrices and invertibility only assume we are operating over an arbitrary field, a mathematical structure that has addition, a 0, multiplication, a 1, and a reciprocal for every non-0 element.

Abstract algebra to the rescue! (Math seems to rescue computer science a lot.) It turns out that there is a field with 256 elements known as GF(2**8). In this field, 0 is the 0, 1 is the 1 (surprise!), addition is bitwise exclusive-or (so every number is its own additive inverse!), and multiplication is a weird, and seemingly unpredictable beast, that is best calculated by pre-populating a table of logarithms and exponents, then creating a 256×256 array containing the multiplication table (since each number fits in a byte, the whole table only takes 64K).

There are actually lots of ways to define multiplication that still meets the requirements of a field, but we have to pick one by defining what’s called the irreducible polynomial for the field.

Once we operate over this field, all matrix values are guaranteed to be in the range [0, 255], and our method for bytewise encoding and decoding becomes easy.

One more aside: zfec actually piles the 3×3 (well, KxK) identity matrix on top of the Vandermonde matrix, so the first K pieces are exactly what you’d get by cutting the file into K pieces in the obvious way and putting the tiny header in front. Even though our matrix is no longer a Vandermonde matrix, submatrices are still guaranteed to be invertible and it still works.