C# Programming Tips and Examples: Entropy

Showing posts with label Entropy. Show all posts

Thursday, November 24, 2016

EntropyGlance

Entropy at a glance

In a hurry? Skip straight to the C# source code - EntropyGlance; Entropy at a glance - A C# WinForms project - https://github.com/AdamWhiteHat/EntropyGlance

So I wrote an file entropy analysis tool for my friend, who works in infosec. Here it is, hands-down the coolest feature this tool offers is a System.Windows.Forms.DataVisualization.Charting visualization that graphs how the entropy changes across a whole file:

This application provides both Shannon (data) entropy and entropy as a compression ratio.
Get a more intuitive feel for the overall entropy at a glance with by visualizing both measures of entropy as a percentage of a progress bar, instead of just numbers.

   However, for those who love numbers, standard measures of entropy are also given as well. Information entropy is expressed both as the quantity of bits/byte (on a range from 0 to 8), and as the 'normalized' value (range 0 to 1). High entropy means it the data is random-looking, like encrypted or compressed information.
   The Shannon 'specific' entropy calculation makes no assumptions about the type of message it is measuring. What this means is that while a message consisting of only 2 symbols will get a very low entropy score of 0.9/8, a message of 52 symbols (the alphabet, as lower case first, then upper) repeated in the same sequence one hundred times would be yield a higher-than-average score of 6/8.
   This is precisely why I included a compression ratio as a ranking of entropy that is much closer to notion of entropy that takes into account repeated patterns or predictable sequences, in the sense of Shannon's source coding theorem.

Dive deep into the symbol distribution and analysis. This screen gives you the per-symbol entropy value and the ability to sort by rank, symbol, ASCII value, count, entropy, and hex value:

As always, the C# source code is being provided, hosted on my GitHub:
EntropyGlance; Entropy at a glance - A C# WinForms project - https://github.com/AdamWhiteHat/EntropyGlance

Tuesday, December 1, 2015

A Simple Word Prediction Library

The word prediction feature on our phones are pretty handy and I've always and thought it would be fun to write one, and last night I decided to check that off my list. As usual, the whole project and all of its code is available to browse on GitHub. I talk more about the library and the design choices I made below the obnoxiously long image:

[Image of Windows Phone's Word Prediction feature]

Visit the project and view the code on my GitHub, right here.
(Project released under Creative Commons)

Overview:

One thing you might notice, if for no other reason than I bring it up, is that I favor composition over inheritance. That is, my classes use a Dictionary internally, but they do not inherit from Dictionary. My word prediction library is not a minor variation or different flavor of the Dictionary class, and while it might be cool to access the word predictions for a word via an indexer, my word prediction library should not be treated as a dictionary.

Under the hood:

There is a dictionary (a list of key/value pairs) of 'Word' objects. Each Word class has a value (the word), and its own dictionary of Word objects implemented as its own separate class (that does not inherit from Dictionary). This hidden dictionary inside each Word class keeps track of the probabilities of the the next word, for that given word. It does so by storing a Word as the key, and an integer counter value that gets incremented every time it tries to add a word to the dictionary that already exists (similar to my frequency dictionary, covered here).
The WordPredictionDictionary class doesn't grow exponentially, because each word is only represented once, by one Word class. The dictionaries inside the Word class only stores the references to the Word objects, not a copy of their values.
In order to begin using the WordPredictionDictionary to suggest words, one must train the WordPredictionDictionary on a representative body of text.

TODO:

~~Write methods to serialize the trained data sets so they can be saved and reloaded.~~ This has been implemented.
Write an intelli-sense-like word suggestion program that implements the WordPredictionDictionary in an end-user application.

Wednesday, August 7, 2013

Pseudo 'random' even distribution table

In my last post, I discussed what a co-prime is and showed you to find them.

So, what's so special about relatively prime numbers? Well, then can be used to create an one-for-one distribution table that is seemingly random, but is deterministically calculated (created).

To understand what I mean, picture the face of a clock...

It has the hours 1 through 12, and if you and an hour to 12, you get 1. This can also be thought of as a single digit in a base 12 number system. Now we need a co-prime to 12. 7 is relatively prime to 12, so lets choose 7.

Starting at hour 1, if we add 7 hours, it will be 8. If we add 7 more hours, we will get 3. 7 more, 10. If we keep adding 7 hours to our clock, the hour hand will land on each of the different numbers exactly once before repeating itself, 12 steps later. Intrigued yet?

If, say, we find a co-prime to the largest number that can be represented by a byte (8-bits, 256 [also expressed as 2^8=256 or 8=Log2(256)]), we can create an array of bytes with a length of 256, containing each of the 256 different possible bytes, distributed in a seemingly random order. The discrete order, or sequence, in which each each number is visited it completely dependent on the value of the co-prime that was selected.

This table is now essentially a one-to-one, bijective mapping of one byte to another. To express this mapping to another party, say to map a stream of bytes back to their original values (decrypt), the entire table need not be exchanged, only the co-prime.

This provides a foundation for an encryption scheme who's technical requirements are similar to handling a cipher-block-chain (CBC) and its changing IV (initialization vector).

Now, it it easy to jump to the conclusion that such an encryption scheme is less secure than a CBC, but this is not necessarily the case. While this approach may be conceptually more simple, the difficulty of discovering the sequence can be made arbitrarily hard.

First of all, the number of relatively prime numbers to 256 is probably infinite. A co-prime to 256 does not have to be less than 256. Indeed, it may be several thousand time greater than 256. Additionally, any prime greater than 256 is, by definition, co-prime to 256, and likely will have a seemingly more 'random' distribution/appearance.

There is, however, a limit here. It does not have to do with the number of co-primes, but is instead limited by the number of possible sequences that can be represented by our array of 256 bytes; eventually, two different co-primes are going to map to the same unique sequence. The order matters, and we don't allow repetition to exist in our sequence. This is called a permutation without repetition, and can be expressed as 256! or 256 factorial and is instructing one to calculate the product of 256 * 255 * 254 * 253 * [...] * 6 * 5 * 4 * 3 * 2 * 1, which equals exactly this number:

857817775342842654119082271681232625157781520279485619859655650377269452553147589377440291360451408450375885342336584306157196834693696475322289288497426025679637332563368786442675207626794560187968867971521143307702077526646451464709187326100832876325702818980773671781454170250523018608495319068138257481070252817559459476987034665712738139286205234756808218860701203611083152093501947437109101726968262861606263662435022840944191408424615936000000000000000000000000000000000000000000000000000000000000000

Yeah, that's right, that number has exactly 63 zeros on the end and is 507 digits long. (As an aside, the reason there is so many zeros on the end of this number is, well for one it is highly composite, but more specifically, its prime factorization includes 2^255 and 5^63 and so 63 fives multiply with 63 of those twos to make 63 tens, and hence that many zeros.)

Above I said arbitrarily hard. So far we have only considered one table, but try and fathom the complexity of many tables. I present three different ways to use multiple tables; Nested, sequentially, and mangled.
Furthermore, the distribution tables can be discarded and replaced.

I will explain what those mean and finish this post tomorrow.

Saturday, July 27, 2013

Information entropy and data compression

In my last post, I talked about Shannon data entropy and showed a class to calculate that. Lets take it one step further and actually compress some data based off the data entropy we calculated.

To do this, first we calculate how many bits are needed to compress each byte of our data. Theoretically, this is the data entropy, rounded up to the next whole number (Math.Ceiling). But this is not always the case, and the number of unique symbols in our data may be a number that is too large to be represented in that many number of bits. We calculate the number of bits needed to represent the number of unique symbols by getting its Base2 logarithm. This returns a decimal (double), so we use Math.Ceiling to round to up to the nearest whole number as well. We set entropy_Ceiling to which ever number is larger. If the entropy_Ceiling is 8, then we should immediately return, as we cannot compress the data any further.

We start by making a compression and decompression dictionary. We make these by taking the sorted distribution dictionary (DataEntropyUTF8.GetSortedDistribution) and start assigning X-bit-length value to each entry in the sorted distribution dictionary, with X being entropy_Ceiling. The compression dictionary has a byte as the key and an array of bool (bool[]) as the value, while the decompression dictionary has an array of bool as the key, and a byte as a value. You'll notice in the decompression dictionary we store the array of bool as a string, as using an actual array as a key will not work, as the dictionary's EqualityComparer will not assign the same hash code for two arrays of the same values.

Then, compression is as easy as reading each byte, and getting the value from the compression dictionary for that byte and adding it to a list of bool (List), then converting that array of bool to an array of bytes.

Decompression consists of converting the compressed array of bytes into an array of bool, then reading in X bools at a time and getting the byte value from the decompression library, again with X being entropy_Ceiling.

But first, to make this process easier, and to make our code more manageable and readable, I define several extension methods to help us out, since .NET provides almost no support for working with data on the bit level, besides the BitArray class. Here are the extension methods that to make working with bits easier:


public static class BitExtentionMethods
{
    //
    // List<bool> extention methods
    //
    public static List<bool> ToBitList(this byte source)
    {
        List<bool> temp = ( new BitArray(source.ToArray()) ).ToList();
        temp.Reverse();
        return temp;
    }
    
    public static List<bool> ToBitList(this byte source,int startIndex)
    {
        if(startIndex<0 || startIndex>7) {
            return new List<bool>();
        }
        return source.ToBitList().GetRange(startIndex,(8-startIndex));
    }
    
    //
    // bool[] extention methods
    //
    public static string GetString(this bool[] source)
    {
        string result = string.Empty;
        foreach(bool b in source)
        {
            if(b) {
                result += "1";
            } else {
                result += "0";
            }
        }
        return result;
    }
    
    public static bool[] ToBitArray(this byte source,int MaxLength)
    {
        List<bool> temp = source.ToBitList(8-MaxLength);
        return temp.ToArray();
    }
    
    public static bool[] ToBitArray(this byte source)
    {
        return source.ToBitList().ToArray();
    }

    //
    // BYTE extention methods
    //
    public static byte[] ToArray(this byte source)
    {
        List<byte> result = new List<byte>();
        result.Add(source);
        return result.ToArray();
    }
    
    //
    // BITARRAY extention methods
    //
    public static List<bool> ToList(this BitArray source)
    {
        List<bool> result = new List<bool>();
        foreach(bool bit in source)
        {
            result.Add(bit);
        }
        return result;
    }
    
    public static bool[] ToArray(this BitArray source)
    {
        return ToList(source).ToArray();
    }
}

Remember, these need to be the base class in a namespace, not in a nested class.

Now, we are free to write our compression/decompression class:


public class BitCompression
{
    // Data to encode
    byte[] data;
    // Compressed data
    byte[] encodeData;
    // # of bits needed to represent data
    int encodeLength_Bits;
    // Original size before padding. Decompressed data will be truncated to this length.
    int decodeLength_Bits;
    // Bits needed to represent each byte (entropy rounded up to nearist whole number)
    int entropy_Ceiling;
    // Data entropy class
    DataEntropyUTF8 fileEntropy;
    // Stores the compressed symbol table
    Dictionary<byte,bool[]> compressionLibrary;
    Dictionary<string,byte> decompressionLibrary;
    
    void GenerateLibrary()
    {
        byte[] distTable = new byte[fileEntropy.Distribution.Keys.Count];
        fileEntropy.Distribution.Keys.CopyTo(distTable,0);
        
        byte bitSymbol = 0x0;
        bool[] bitBuffer = new bool[entropy_Ceiling];
        foreach(byte symbol in distTable)
        {
            bitBuffer = bitSymbol.ToBitArray(entropy_Ceiling);
            compressionLibrary.Add(symbol,bitBuffer);
            decompressionLibrary.Add(bitBuffer.GetString(),symbol);
            bitSymbol++;
        }
    }
    
    public byte[] Compress()
    {
        // Error checking
        if(entropy_Ceiling>7 || entropy_Ceiling<1) {
            return data;
        }

        // Compress data using compressionLibrar
        List<bool> compressedBits = new List<bool>();
        foreach(byte bite in data) {    // Take each byte, find the matching bit array in the dictionary
            compressedBits.AddRange(compressionLibrary[bite]);
        }
        decodeLength_Bits = compressedBits.Count;
        
        // Pad to fill last byte
        while(compressedBits.Count % 8 != 0) {
            compressedBits.Add(false);  // Pad to the nearest byte
        }
        encodeLength_Bits = compressedBits.Count;
        
        // Convert from array of bits to array of bytes
        List<byte> result = new List<byte>();
        int count = 0;
        int shift = 0;
        int offset= 0;
        int stop  = 0;
        byte current = 0;
        do
        {
            stop = encodeLength_Bits - count;
            stop = 8 - stop;
            if(stop<0) {
                stop = 0;
            }
            if(stop<8)
            {
                shift = 7;
                offset = count;
                current = 0;
                
                while(shift>=stop)
                {
                    current |= (byte)(Convert.ToByte(compressedBits[offset]) << shift);
                    shift--;
                    offset++;
                }
                
                result.Add(current);
                count += 8;
            }
        } while(count < encodeLength_Bits);
        
        encodeData = result.ToArray();
        return encodeData;
    }
    
    public byte[] Decompress(byte[] compressedData)
    {
        // Error check
        if(compressedData.Length<1) {
            return null;
        }
        
        // Convert to bit array for decompressing
        List<bool> bitArray = new List<bool>();
        foreach(byte bite in compressedData) {
            bitArray.AddRange(bite.ToBitList());
        }
        
        // Truncate to original size, removes padding for byte array
        int diff = bitArray.Count-decodeLength_Bits;
        if(diff>0) {
            bitArray.RemoveRange(decodeLength_Bits-1,diff);
        }

        // Decompress
        List<byte> result = new List<byte>();
        int count = 0;
        do
        {
            bool[] word = bitArray.GetRange(count,entropy_Ceiling).ToArray();
            result.Add(decompressionLibrary[word.GetString()]);
            
            count+=entropy_Ceiling;
        } while(count < bitArray.Count);
        
        return result.ToArray();
    }
    
    public BitCompression(string filename)
    {
        compressionLibrary  = new Dictionary<byte, bool[]>();
        decompressionLibrary = new Dictionary<string, byte>();
        
        if(!File.Exists(filename))  {
            return;
        }
        
        data = File.ReadAllBytes(filename);
        fileEntropy = new DataEntropyUTF8();
        fileEntropy.ExamineChunk(data);
        
        int unique  = (int)Math.Ceiling(Math.Log((double)fileEntropy.UniqueSymbols,2f));
        int entropy = (int)Math.Ceiling(fileEntropy.Entropy);
        
        entropy_Ceiling = Math.Max(unique,entropy);
        encodeLength_Bits   = data.Length * entropy_Ceiling;
        
        GenerateLibrary();
    }
}

Please feel free to comment with ideas, suggestions or corrections.

Monday, July 22, 2013

Information Shannon Entropy

Shannon/data entropy is a measurement of uncertainty. Entropy can be used as a measure of randomness. Data entropy is typically expressed as the number of bits needed to encode or represent data. In the example below, we are working with bytes, so the max entropy for a stream of bytes is 8.

A file with high entropy means that each symbol is more-or-less equally as likely to appear next. If a file or file stream has high entropy, it is either probably compressed, encrypted or random. This can be used to detect packed executables, cipher streams on a network, or a breakdown of encrypted communication on a network that is expected to be always encrypted.

A text file will have low entropy. If a file has low data entropy, it mean that the file will compress well.

This post and code was inspired by Mike Schiffman's excelent explaination of data entropy on his Cisco Security Blog.

Here is what I wrote:


using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;

namespace DataEntropy
{
    public class DataEntropyUTF8
    {
        // Stores the number of times each symbol appears
        SortedList<byte,int>        distributionDict;
        // Stores the entropy for each character
        SortedList<byte,double> probabilityDict;
        // Stores the last calculated entropy
        double overalEntropy;
        // Used for preventing unnecessary processing
        bool isDirty;
        // Bytes of data processed
        int dataSize;
        
        public int DataSampleSize
        {
            get { return dataSize; }
            private set { dataSize = value; }
        }
        
        public int UniqueSymbols
        {
            get { return distributionDict.Count; }
        }
        
        public double Entropy
        {
            get { return GetEntropy(); }
        }
        
        public Dictionary<byte,int> Distribution
        {
            get { return GetSortedDistribution(); }
        }
        
        public Dictionary<byte,double> Probability
        {
            get { return GetSortedProbability(); }
        }
        
        public byte GetGreatestDistribution()
        {
            return distributionDict.Keys[0];
        }
        
        public byte GetGreatestProbability()
        {
            return probabilityDict.Keys[0];
        }
        
        public double GetSymbolDistribution(byte symbol)
        {
            return distributionDict[symbol];
        }
        
        public double GetSymbolEntropy(byte symbol)
        {
            return probabilityDict[symbol];
        }
        
        Dictionary<byte,int> GetSortedDistribution()
        {
            List<Tuple<int,byte>> entryList = new List<Tuple<int, byte>>();
            foreach(KeyValuePair<byte,int> entry in distributionDict)
            {
                entryList.Add(new Tuple<int,byte>(entry.Value,entry.Key));
            }
            entryList.Sort();
            entryList.Reverse();
            
            Dictionary<byte,int> result = new Dictionary<byte, int>();
            foreach(Tuple<int,byte> entry in entryList)
            {
                result.Add(entry.Item2,entry.Item1);
            }
            return result;
        }
        
        Dictionary<byte,double> GetSortedProbability()
        {
            List<Tuple<double,byte>> entryList = new List<Tuple<double,byte>>();
            foreach(KeyValuePair<byte,double> entry in probabilityDict)
            {
                entryList.Add(new Tuple<double,byte>(entry.Value,entry.Key));
            }
            entryList.Sort();
            entryList.Reverse();
            
            Dictionary<byte,double> result = new Dictionary<byte,double>();
            foreach(Tuple<double,byte> entry in entryList)
            {
                result.Add(entry.Item2,entry.Item1);
            }
            return result;
        }
        
        double GetEntropy()
        {
            // If nothing has changed, dont recalculate
            if(!isDirty) {
                return overalEntropy;
            }
            // Reset values
            overalEntropy = 0;
            probabilityDict = new SortedList<byte,double>();
            
            foreach(KeyValuePair<byte,int> entry in distributionDict)
            {
                // Probability = Freq of symbol / # symbols examined thus far
                probabilityDict.Add(
                    entry.Key,
                    (double)distributionDict[entry.Key] / (double)dataSize
                );
            }
            
            foreach(KeyValuePair<byte,double> entry in probabilityDict)
            {
                // Entropy = probability * Log2(1/probability)
                overalEntropy += entry.Value * Math.Log((1/entry.Value),2);
            }
            
            isDirty = false;
            return overalEntropy;
        }
        
        public void ExamineChunk(byte[] chunk)
        {
            if(chunk.Length<1 || chunk==null) {
                return;
            }
            
            isDirty = true;
            dataSize += chunk.Length;
            
            foreach(byte bite in chunk)
            {
                if(!distributionDict.ContainsKey(bite))
                {
                    distributionDict.Add(bite,1);
                    continue;
                }
                distributionDict[bite]++;
            }
        }
        
        public void ExamineChunk(string chunk)
        {
            ExamineChunk(StringToByteArray(chunk));
        }
        
        byte[] StringToByteArray(string inputString)
        {
            char[] c = inputString.ToCharArray();
            IEnumerable<byte> b = c.Cast<byte>();
            return b.ToArray();
        }
        
        void Clear()
        {
            isDirty = true;
            overalEntropy = 0;
            dataSize = 0;
            distributionDict = new SortedList<byte, int>();
            probabilityDict = new SortedList<byte, double>();
        }
        
        public DataEntropyUTF8(string fileName)
        {
            this.Clear();
            if(File.Exists(fileName))
            {
                ExamineChunk(  File.ReadAllBytes(fileName) );
                GetEntropy();
                GetSortedDistribution();
            }
        }
        
        public DataEntropyUTF8()
        {
            this.Clear();
        }
    }
}

C# Programming Tips and Examples

Tags

Thursday, November 24, 2016

EntropyGlance

Entropy at a glance

Tuesday, December 1, 2015

A Simple Word Prediction Library

Overview:

Under the hood:

TODO:

Wednesday, August 7, 2013

Pseudo 'random' even distribution table

Saturday, July 27, 2013

Information entropy and data compression

Monday, July 22, 2013

Information Shannon Entropy

Blog Navigation

Tags

Thursday, November 24, 2016

EntropyGlance

Entropy at a glance

Tuesday, December 1, 2015

A Simple Word Prediction Library

Overview:

Under the hood:

TODO:

Wednesday, August 7, 2013

Pseudo 'random' even distribution table

Saturday, July 27, 2013

Information entropy and data compression

Monday, July 22, 2013

Information Shannon Entropy

Blog Navigation

How to Subscribe