Saturday, July 27, 2013

Information entropy and data compression



In my last post, I talked about Shannon data entropy and showed a class to calculate that. Lets take it one step further and actually compress some data based off the data entropy we calculated.

To do this, first we calculate how many bits are needed to compress each byte of our data. Theoretically, this is the data entropy, rounded up to the next whole number (Math.Ceiling). But this is not always the case, and the number of unique symbols in our data may be a number that is too large to be represented in that many number of bits. We calculate the number of bits needed to represent the number of unique symbols by getting its Base2 logarithm. This returns a decimal (double), so we use Math.Ceiling to round to up to the nearest whole number as well. We set entropy_Ceiling to which ever number is larger. If the entropy_Ceiling is 8, then we should immediately return, as we cannot compress the data any further.

We start by making a compression and decompression dictionary. We make these by taking the sorted distribution dictionary (DataEntropyUTF8.GetSortedDistribution) and start assigning X-bit-length value to each entry in the sorted distribution dictionary, with X being entropy_Ceiling. The compression dictionary has a byte as the key and an array of bool (bool[]) as the value, while the decompression dictionary has an array of bool as the key, and a byte as a value. You'll notice in the decompression dictionary we store the array of bool as a string, as using an actual array as a key will not work, as the dictionary's EqualityComparer will not assign the same hash code for two arrays of the same values.

Then, compression is as easy as reading each byte, and getting the value from the compression dictionary for that byte and adding it to a list of bool (List), then converting that array of bool to an array of bytes.

Decompression consists of converting the compressed array of bytes into an array of bool, then reading in X bools at a time and getting the byte value from the decompression library, again with X being entropy_Ceiling.




But first, to make this process easier, and to make our code more manageable and readable, I define several extension methods to help us out, since .NET provides almost no support for working with data on the bit level, besides the BitArray class. Here are the extension methods that to make working with bits easier:

public static class BitExtentionMethods
{
    //
    // List<bool> extention methods
    //
    public static List<bool> ToBitList(this byte source)
    {
        List<bool> temp = ( new BitArray(source.ToArray()) ).ToList();
        temp.Reverse();
        return temp;
    }
    
    public static List<bool> ToBitList(this byte source,int startIndex)
    {
        if(startIndex<0 || startIndex>7) {
            return new List<bool>();
        }
        return source.ToBitList().GetRange(startIndex,(8-startIndex));
    }
    
    //
    // bool[] extention methods
    //
    public static string GetString(this bool[] source)
    {
        string result = string.Empty;
        foreach(bool b in source)
        {
            if(b) {
                result += "1";
            } else {
                result += "0";
            }
        }
        return result;
    }
    
    public static bool[] ToBitArray(this byte source,int MaxLength)
    {
        List<bool> temp = source.ToBitList(8-MaxLength);
        return temp.ToArray();
    }
    
    public static bool[] ToBitArray(this byte source)
    {
        return source.ToBitList().ToArray();
    }

    //
    // BYTE extention methods
    //
    public static byte[] ToArray(this byte source)
    {
        List<byte> result = new List<byte>();
        result.Add(source);
        return result.ToArray();
    }
    
    //
    // BITARRAY extention methods
    //
    public static List<bool> ToList(this BitArray source)
    {
        List<bool> result = new List<bool>();
        foreach(bool bit in source)
        {
            result.Add(bit);
        }
        return result;
    }
    
    public static bool[] ToArray(this BitArray source)
    {
        return ToList(source).ToArray();
    }
}
Remember, these need to be the base class in a namespace, not in a nested class.


Now, we are free to write our compression/decompression class:

public class BitCompression
{
    // Data to encode
    byte[] data;
    // Compressed data
    byte[] encodeData;
    // # of bits needed to represent data
    int encodeLength_Bits;
    // Original size before padding. Decompressed data will be truncated to this length.
    int decodeLength_Bits;
    // Bits needed to represent each byte (entropy rounded up to nearist whole number)
    int entropy_Ceiling;
    // Data entropy class
    DataEntropyUTF8 fileEntropy;
    // Stores the compressed symbol table
    Dictionary<byte,bool[]> compressionLibrary;
    Dictionary<string,byte> decompressionLibrary;
    
    void GenerateLibrary()
    {
        byte[] distTable = new byte[fileEntropy.Distribution.Keys.Count];
        fileEntropy.Distribution.Keys.CopyTo(distTable,0);
        
        byte bitSymbol = 0x0;
        bool[] bitBuffer = new bool[entropy_Ceiling];
        foreach(byte symbol in distTable)
        {
            bitBuffer = bitSymbol.ToBitArray(entropy_Ceiling);
            compressionLibrary.Add(symbol,bitBuffer);
            decompressionLibrary.Add(bitBuffer.GetString(),symbol);
            bitSymbol++;
        }
    }
    
    public byte[] Compress()
    {
        // Error checking
        if(entropy_Ceiling>7 || entropy_Ceiling<1) {
            return data;
        }

        // Compress data using compressionLibrar
        List<bool> compressedBits = new List<bool>();
        foreach(byte bite in data) {    // Take each byte, find the matching bit array in the dictionary
            compressedBits.AddRange(compressionLibrary[bite]);
        }
        decodeLength_Bits = compressedBits.Count;
        
        // Pad to fill last byte
        while(compressedBits.Count % 8 != 0) {
            compressedBits.Add(false);  // Pad to the nearest byte
        }
        encodeLength_Bits = compressedBits.Count;
        
        // Convert from array of bits to array of bytes
        List<byte> result = new List<byte>();
        int count = 0;
        int shift = 0;
        int offset= 0;
        int stop  = 0;
        byte current = 0;
        do
        {
            stop = encodeLength_Bits - count;
            stop = 8 - stop;
            if(stop<0) {
                stop = 0;
            }
            if(stop<8)
            {
                shift = 7;
                offset = count;
                current = 0;
                
                while(shift>=stop)
                {
                    current |= (byte)(Convert.ToByte(compressedBits[offset]) << shift);
                    shift--;
                    offset++;
                }
                
                result.Add(current);
                count += 8;
            }
        } while(count < encodeLength_Bits);
        
        encodeData = result.ToArray();
        return encodeData;
    }
    
    public byte[] Decompress(byte[] compressedData)
    {
        // Error check
        if(compressedData.Length<1) {
            return null;
        }
        
        // Convert to bit array for decompressing
        List<bool> bitArray = new List<bool>();
        foreach(byte bite in compressedData) {
            bitArray.AddRange(bite.ToBitList());
        }
        
        // Truncate to original size, removes padding for byte array
        int diff = bitArray.Count-decodeLength_Bits;
        if(diff>0) {
            bitArray.RemoveRange(decodeLength_Bits-1,diff);
        }

        // Decompress
        List<byte> result = new List<byte>();
        int count = 0;
        do
        {
            bool[] word = bitArray.GetRange(count,entropy_Ceiling).ToArray();
            result.Add(decompressionLibrary[word.GetString()]);
            
            count+=entropy_Ceiling;
        } while(count < bitArray.Count);
        
        return result.ToArray();
    }
    
    public BitCompression(string filename)
    {
        compressionLibrary  = new Dictionary<byte, bool[]>();
        decompressionLibrary = new Dictionary<string, byte>();
        
        if(!File.Exists(filename))  {
            return;
        }
        
        data = File.ReadAllBytes(filename);
        fileEntropy = new DataEntropyUTF8();
        fileEntropy.ExamineChunk(data);
        
        int unique  = (int)Math.Ceiling(Math.Log((double)fileEntropy.UniqueSymbols,2f));
        int entropy = (int)Math.Ceiling(fileEntropy.Entropy);
        
        entropy_Ceiling = Math.Max(unique,entropy);
        encodeLength_Bits   = data.Length * entropy_Ceiling;
        
        GenerateLibrary();
    }
}


Please feel free to comment with ideas, suggestions or corrections.