Showing posts with label Statistics. Show all posts
Showing posts with label Statistics. Show all posts

Thursday, November 24, 2016

EntropyGlance

Entropy at a glance



In a hurry? Skip straight to the C# source code - EntropyGlance; Entropy at a glance - A C# WinForms project - https://github.com/AdamWhiteHat/EntropyGlance



So I wrote an file entropy analysis tool for my friend, who works in infosec. Here it is, hands-down the coolest feature this tool offers is a System.Windows.Forms.DataVisualization.Charting visualization that graphs how the entropy changes across a whole file:



This application provides both Shannon (data) entropy and entropy as a compression ratio.
Get a more intuitive feel for the overall entropy at a glance with by visualizing both measures of entropy as a percentage of a progress bar, instead of just numbers.





   However, for those who love numbers, standard measures of entropy are also given as well. Information entropy is expressed both as the quantity of bits/byte (on a range from 0 to 8), and as the 'normalized' value (range 0 to 1). High entropy means it the data is random-looking, like encrypted or compressed information.
   The Shannon 'specific' entropy calculation makes no assumptions about the type of message it is measuring. What this means is that while a message consisting of only 2 symbols will get a very low entropy score of 0.9/8, a message of 52 symbols (the alphabet, as lower case first, then upper) repeated in the same sequence one hundred times would be yield a higher-than-average score of 6/8.
   This is precisely why I included a compression ratio as a ranking of entropy that is much closer to notion of entropy that takes into account repeated patterns or predictable sequences, in the sense of Shannon's source coding theorem.



Dive deep into the symbol distribution and analysis. This screen gives you the per-symbol entropy value and the ability to sort by rank, symbol, ASCII value, count, entropy, and hex value:



As always, the C# source code is being provided, hosted on my GitHub:
EntropyGlance; Entropy at a glance - A C# WinForms project - https://github.com/AdamWhiteHat/EntropyGlance



Wednesday, June 29, 2016

Bloom Filter - A novel, space efficient data structure like a hash-table for billions of values.




Introduction


A bloom filter is a truly novel data structure. Similar to a hash table, it can tell you if you've hashed a particular value previously. You can add many, many more values to a bloom filter than you can to a hash table, does not degrade performance as the number of values in the set grows large, and requires only a fraction of the space of a hash table to store it!

This is not just an academic exercise, or something that only works in theory or in special cases. Indeed, companies like google use bloom filters to quickly determine if it has never seen that value before, thus avoiding a more costly lookup against a database every time the bloomfilter returns false.



Probabilistic


First off, its important to understand that a bloom filter is NOT a hash table, it operates in an entirely different way. A bloom-filter is what is known as a probabilistic data structure. What this means is, that it can tell you to within a certain probability, if an element exists in a set. In other words, false positive matches ARE possible, but false negative matches ARE NOT possible. For example, if you check a bloom filter for the existence of a value, and it returns false, you can know with 100% certainty that the bloom filter does not contain that value value before. However, if you test a value against the filter and it returns true, there is a small probability that it has in fact not seen that value before, but is returning a false positive. How big of a probability? Here's the beauty: It can be as small as you want it to be. It depends on a few factors, including the size of the filter, how full it is, and how many bits you use to store each value in the filter.

In a HashTable class, each item is stored as a key value pair, so the size of your object plus a 32 bit integer. Contrast that to a bloom filter, which stores only about 3-7 bits per value hashed. Also, my implementation applies compression when saving the filter to disk, providing even more space savings. A bloom filter with 160,000 values hashed and a 1% collision probability results in a filter that is 235KB uncompressed, and a whopping 54KB when compressed! Remember the filter is an array of bits. The entropy of the array is going to be at its greatest, and thus the compression ratio lowest, when exactly 1/2 of the bits are flipped, or the filter is half-way 'full'. This has the unusual property of getting smaller as you add more hashes to the filter. Actually, this is misleading--the actual filter itself never changes size, its only the compressed version that varies in size.

To handle the compression I just used the System.IO.Compression.DeflateStream class. An important note about working with this class: build an array of bytes and send your entire file in one go. In this way it will compress the whole file as one chunk. If you sent data to this stream piecemeal, it will compress each piece separately and you will get a poor compression ratio.

How it works


So how does this all work? The filter part of a bloom filter is just a large array of bits. You also require several different hash functions that each return a unique result for the same input value. When you add a value to the filter, the value is sent to about 3-7 different hash functions. Each hash function will return a value that is between 0 and the number of bits in the filter. Each value is used as an index to access and element on the array of bits that is your filter. When hashing a value, you just set the bit at each index location in the array to 1. Then testing for the presence of a value in the filter, you pass the value to the hash functions the same way as above, then visit each index, checking to see if any of them are 0. If even one bit at one of those index positions are zero, it means the filter has never seen that value before, because it would have set all those bits to 1. If all the bits at the index locations are 1, then it is likely that the filter has seen that value before. However,there is a chance that it is a false positive, because it could be that that value's different hashes all mapped to bits from other values. As the filter becomes more full, more bits are set to 1, and so the odds of a false positive go up. To build your filter by supplying the estimated number of values you think you are likely to store in the filter, and don't go above a certain ratio of 1 bits to 0 bits. If you were to let your filter hash so many values that every bit got set to 1, then the probability of receiving a false positive for a random value becomes 100%.



Solving the many hash problem

As I mentioned before, this requires several different hash functions that each return a unique result for the same input value. Although I said 3-7 hash functions, you might require 14 or more, when working with filters that can handle large number of hashes or a low false positive likelihood or both.

Instead of writing a bunch of separate hash algorithms, I implemented a stream cipher where in I just scramble the cipher table by a number of rounds that is unique to that input. Then, I can return as many indices as the filter is configured for. This sets up the table once per value. It needs to reset the table or else the indices that we mark will depend on every value that came before it, and in that particular order. Currently the bottle-neck is how many times it has the scramble the table for each value. If you need to hash really long values, you'll want to lower the number of rounds it scrambles the table.



Variations


In this implementation, the bloom-filter size is set once you create it, meaning that it cannot grow bigger if it gets too full, nor can you resize this bloom filter to become smaller if you sized it too big. Because multiple values could rely on the same bit, this implementation does not support removal of items, because to do so would cause several values to begin reporting false negatives.

In order to make a bloom filter that supports deletion, use a number like a byte instead of bits in your filter, and each time you visit an index in the filter while adding values, increment the number you find there. Then, to delete a value, visit each index as you did before, but decrement the number there. This way, if two values map to the same index, that information is tracked by incrementing the value. This is what is known as a Counting Bloom Filter.

There are other variants of bloom filters out there, including bloom filters that can grow in size if it gets too full, but such a thing is beyond the scope of my needs. In essence, when the filter gets too full, you create another separate filter, and add new values by first checking the first filter to see if it exists, and if not, adding the value to the second filter. Checking for the presence of a value requires checking both (and other) filters. For information on scalable bloom filters, please see this whitepaper.

The code


My C# Bloom Filter project on GitHub
Or download zip here.








Tuesday, December 1, 2015

A Simple Word Prediction Library




   The word prediction feature on our phones are pretty handy and I've always and thought it would be fun to write one, and last night I decided to check that off my list. As usual, the whole project and all of its code is available to browse on GitHub. I talk more about the library and the design choices I made below the obnoxiously long image:


[Image of Windows Phone's Word Prediction feature]

   Visit the project and view the code on my GitHub, right here.
      (Project released under Creative Commons)

Overview:

   One thing you might notice, if for no other reason than I bring it up, is that I favor composition over inheritance. That is, my classes use a Dictionary internally, but they do not inherit from Dictionary. My word prediction library is not a minor variation or different flavor of the Dictionary class, and while it might be cool to access the word predictions for a word via an indexer, my word prediction library should not be treated as a dictionary.

Under the hood:

   There is a dictionary (a list of key/value pairs) of 'Word' objects. Each Word class has a value (the word), and its own dictionary of Word objects implemented as its own separate class (that does not inherit from Dictionary). This hidden dictionary inside each Word class keeps track of the probabilities of the the next word, for that given word. It does so by storing a Word as the key, and an integer counter value that gets incremented every time it tries to add a word to the dictionary that already exists (similar to my frequency dictionary, covered here).
The WordPredictionDictionary class doesn't grow exponentially, because each word is only represented once, by one Word class. The dictionaries inside the Word class only stores the references to the Word objects, not a copy of their values.
In order to begin using the WordPredictionDictionary to suggest words, one must train the WordPredictionDictionary on a representative body of text.

TODO:

  • Write methods to serialize the trained data sets so they can be saved and reloaded. This has been implemented.
  • Write an intelli-sense-like word suggestion program that implements the WordPredictionDictionary in an end-user application.

Tuesday, October 6, 2015

Mixed Radix Numeral System class and Counter


Mixed Radix Calculator

   My 'Mixed Radix Calculator' creates a counting system of radices (plural of radix), such as base 12 or mixed radices such as Minutes/Hours/Days/Years: 365:24:60:60. I choose the left side to be the most significant side. This is merely a personal preference, and my MixedRadixSystem class supports displaying both alignments.

   Of course you dont have to choose a mixed radix numeral system, you can count in an N-base numeral system, such as base 7 or a more familiar base 16. Another feature lies in my RadixNumeral class. Each numeral, or place value, supports having its own dictionary of symbols.


Screenshot of Mixed Radix Calculator
      (Project released under Creative Commons)

-  52:7:24:60:60:1000  -


  A numeral system (or system of numeration) is a writing system for expressing numbers.


  The most familiar one is of course the decimal numeral system. This is a 10-base numbering system. Computers use a binary numeral system. The base is sometimes called the radix or scale.

  Not all numbering systems have just one base. Take for example, how we currently divide time: There are 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day, and 365 days in a year. This is called a mixed radix numeral system, and one might express the above mixed radix system like: 365:24:60:60.

  https://en.wikipedia.org/wiki/Mixed_radix
  http://mathworld.wolfram.com/Base.html

Uses:
  I haven't found a lot of use cases for it yet, but it is interesting. I originally built this because I wanted to experiment with numeral systems that uses increasing consecutive prime numbers for each radix, as well as experiment with some off-bases, such as base 3 or base 7.

  In a single base, say base 7, then 'round numbers' with only one place value having a 1 and the rest having zeros, such as 1:0:0:0:0 (in base 7), such numbers are powers of 7, and ever other number except for the 1's place value is a multiple of 7.

  A mixed radix numeral system can represent a polynomial, and possibly provide for a simpler way to visualize and reason about them.

  Yet another possible use is to make a numeral system with a base that is larger than and co-prime to some other target number (say 256) to make a bijective map from every value in a byte to some other value exactly once by repeatedly adding the value of the co-prime, modulus 256. This can appear rather random (or sometimes not at all) but the mapping is easily determined given the co-prime. I have talked about this notion before on my blog
  https://csharpcodewhisperer.blogspot.com/search/label/Coprime

  If you like this project you would probably like my project EquationFinder, it finds equations given constraints
  https://github.com/AdamWhiteHat/EquationFinder


Tuesday, September 22, 2015

Threaded Equation Finder



Threaded Equation Finder

Find arithmetic equations that equates to a given 'target' value, number of terms, and operators.

Introduction

   You should all be familiar with how a typical computer works; you give it some variables, describe some quantities of some resources you have, choose an algorithm, let it process, and it returns to you a result or outcome. Now imagine a computer if you could work with a computer that worked the other way around. I believe it was Douglas Adams that described the notion of an all-together different type of computer; That is, you tell the computer what you want the outcome to be, and it goes off figuring out how to get there and what you need to do it. Z3, the Theorem Prover, and the constraint satisfaction problem (CSP) solver (and probably others) in Microsoft's Solver Foundation do almost exactly that.
   There is also the idea of Backcasting, which is a similar, but different idea.

   My program isn't as fancy as all that, but it does find equations that equates to a given 'target' value, albeit at random. You define constraints other than just the target value, such as what operators are allowed in the equation, the quantity of terms there, and the range or set of allowed terms.
For example, how many different ways can 9 nines equal 27, using only addition, subtraction, and multiplication, if evaluated left-to-right (ignore the order of operations)? Turns out there are only 67 ways.

(above) Application Screen Shot

How it works

   The actual approach for finding equations that equate to an arbitrary target or 'goal' value is rather primitive. By way of Brute Force, several threads are launched asynchronously to generate thousands of random equations, evaluate them and keep the results that have not been found yet.

   This is something I wrote approx. 2 years ago. I dug it up and decided to publish it, because I thought it was interesting. As this was an 'experiment', I created different ways of storing and evaluating the expressions, and have made those different 'strategies' conform to a common interface so I could easily swap them out to compare the different strategies. I have refactored the code so that each class that implements IEquation is in its own project and creates its own assembly.

   There are two fully-working strategies for representing equations: one that represented the equation as a list of 2-tuples (Term,Operator), did not perform order of operations, and was rather obtuse. The other strategy was to store the equation as a string and evaluate it using MSScriptControl.ScriptControl to Eval the string as a line of VBScript. This was unsurprisingly slower but allowed for much more robust equation evaluation. Order of operations is respected with the ScriptControl strategy, and opens the way to using using parenthesis.

   The other idea for a strategy which I have not implemented but would like to, would be a left-recursive Linq.Expression builder. Also, maybe I could somehow use Microsoft Solver Foundation for a wiser equation generation strategy than at random.



Limitations

   Today, however, there are better architectures. A concurrent system like this would benefit greatly from the Actor model. If you wanted to write complex queries against the stream of equations being generated or selected or solved, maybe reactive extensions would be a slam dunk.

   Although this project certainly is no Z3, it does provide an example of an interface... perhaps even a good one.



Running on Raspberry Pi 2

   Microsoft's Managed Extensibility Framework (MEF) might be a good thing here, but I also wrote a console client that is designed to be ran with Mono on the Raspberry Pi 2. MEF is a proprietary Microsoft .NET technology that is not supported in Mono. The extra meta data in the assembly shouldn't be a problem, but having a dependency on the MEF assembly will be. Probing of the runtime environment and dynamically loading of assemblies is required here, which I have not had time to do, so at this time, there is no MEF.

   The reason the mono client is a console application is because mono and winforms on the Raspberry Pi 2 fails for some people. The problem has something to do with a hardware floating point vs a software float, and it happens to manifest itself when using a TextBox control. The only thing that I haven't tried, and that should presumably fix it, is to re-build mono from the latest source.



Wednesday, December 3, 2014

Word Frequency Dictionary & Sorted Occurrence Ranking Dictionary for Generic Item Quantification and other Statistical Analyses



Taking a little inspiration from DotNetPerl's MultiMap, I created a generic class that uses a Dictionary under the hood to create a keyed collection that tracks the re-occurrences or frequency matching or duplicate items of arbitrary type. For lack of a better name, I call it a FrequencyDicionary. Instead of Key/Value pairs, the FrequencyDictionary has the notion of a Item/Frequency pair, with the key being the Item. I strayed from Key/Value because I may ultimately end up calling the Item the Value.

I invented the FrequencyDictionary while writing a pisg-like IRC log file statistics generator program. It works well for word frequency analysis. After some pre-processing/cleaning of the text log file, I parse each line by spaces to get an array of words. I then use FrequencyDictionary.AddRange to add the words to my dictionary.

The FrequencyDictionary is essentially a normal Dictionary (with T being type String in this case), however when you attempt to add a key that already exists, it simply increments the Value integer. After processing the file, you essentially have a Dictionary where the Key is a word that was found in the source text, and the Value is the number of occurrences of that word in the source text.

Taking the idea even further, I created complementary Dictionary that essentially a sorted version of FrequencyDictionary with the Key/Values reversed. I call it a RankingDictionary because the Keys represent the number of occurrences of a word, with the Values being a List of words that occurred that many times in the source text. Its called a RankingDictionary because I was using it to produce a top 10 ranking of most popular words, most popular nicks, ect.

The FrequencyDictionary has a GetRankingDictionary() method in it to make the whole thing very easy to use. Typically I don't use the FrequencyDictionary too extensively, but rather as a means to get a RankingDictionary, which I base the majority of my IRC statistics logic on. The RankingDictionary also proved very useful for finding Naked Candidates and Hidden Pairs or Triples in my Sudoku solver application that I will be releasing on Windows, Windows Phone and will be blogging about shortly. Hell, I was even thinking about releasing the source code to my Sudoku App, since its so elegant and a great example of beautiful, readable code to a complex problem.

Anyways, the code for the Frequency and Ranking Dictionary is heavily commented with XML Documentation Comments, so I'm going to go ahead and let the code speak for itself. I will add usage examples later. In fact, I will probably release the pisg-like IRC Stats Prog source code since I don't think I'm going to go any farther with it.

Limitations: I ran into problems trying to parse large text files approaching 3-4 megabytes. Besides taking up a bunch of memory, the Dictionary begins to encounter many hash collisions once the size of the collection gets large enough. This completely kills performance, and eventually most the time is spent trying to resolve collisions, so it grinds to a near halt. You might notice the constructor public FrequencyDictionary(int Capacity), where you can specify a maximum capacity for the Dictionary. A good, safe ceiling is about 10,000. A better implementation of the GetHash() method might be in order, but is not a problem I have felt like solving yet.




/// <summary>
/// A keyed collection of Item/Frequency pairs, (keyed off Item).
/// If a duplicate Item is added to the Dictionary, the Frequency for that Item is incremented.
/// </summary>
public class FrequencyDictionary<ITM>
{
  // The underlying Dictionary
  private Dictionary<ITM, int> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public FrequencyDictionary() { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that has a maximum Capacity limit.
  /// </summary>
  public FrequencyDictionary(int Capacity) { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Gets a collection containing the Items in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<ITM> ItemCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Values;} }
  
  /// <summary>
  /// Adds the specified Item to the FrequencyDictionary.
  /// </summary>
  public void Add(ITM Item)
  {
    if  ( this._dictionary.ContainsKey(Item))   { this._dictionary[Item]++; }
    else    { this._dictionary.Add(Item,1); }
  }
  
  /// <summary>
  /// Adds the elements of the specified array to the FrequencyDictionary.
  /// </summary>
  public void AddRange(ITM[] Items)
  {
    foreach(ITM item in Items) { this.Add(item); }
  }
  
  /// <summary>
  /// Gets the Item that occurs most frequently.
  /// </summary>
  /// <returns>A KeyValuePair containing the Item (key) and how many times it has appeard (value).</returns>
  public KeyValuePair<ITM,int> GetMostFrequent()
  {
    int maxValue = this._dictionary.Values.Max();
    return this._dictionary.Where(kvp => kvp.Value == maxValue).FirstOrDefault();
  }
  
  /// <summary>
  /// Gets the number of Item/Frequency pairs contained in the FrequencyDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the FrequencyDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<ITM,int>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the Frequency (occurrences) associated with the specified Item.
  /// </summary>
  public int this[ITM Item]
  {
    get { if (this._dictionary.ContainsKey(Item)) { return this._dictionary[Item]; } return 0; }
  }
  
  /// <summary>
  /// Creates a RankingDictionary from the current FrequencyDictionary.
  /// </summary>
  /// <returns>A RankingDictionary of Frequency/ItemCollection pairs ordered by Frequency.</returns>
  public RankingDictionary<ITM> GetRankingDictionary()
  {
    RankingDictionary<ITM> result = new RankingDictionary<ITM>();
    foreach(KeyValuePair<ITM,int> kvp in _dictionary)
    {
      result.Add(kvp.Value,kvp.Key);
    }
    return result;
  }
  
  /// <summary>
  /// Displays usage information for FrequencyDictionary 
  /// </summary>
  public override string ToString()
  {
    return "FrequencyDictionary<Item, Frequency> : Key=\"Item=\", Value=\"Frequency\"\".";
  }
}




And now the RankingDictionary:

/// <summary>
/// A keyed collection of Frequency/ItemCollection pairs that is ordered by Frequency (rank).
/// If an Item is added that has the same Frequency as another Item, that Item is added to the Item collection for that Frequency.
/// </summary>
public class RankingDictionary<ITM>
{
  // Underlying dictionary
  SortedDictionary<int,List<ITM>> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public RankingDictionary() { _dictionary = new SortedDictionary<int,List<ITM>>(new FrequencyComparer()); }
  
  /// <summary>
  /// The Comparer used to compare Frequencies.
  /// </summary>
  public class FrequencyComparer : IComparer<int>
  {
    public int Compare(int one,int two) { if(one == two) return 0; else if(one > two) return -1; else return 1; }
  }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the RankingDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the ItemCollection in the RankingDictionary.
  /// </summary>
  public IEnumerable<List<ITM>> ItemCollections   { get { return this._dictionary.Values; } }
  
  /// <summary>
  /// Adds the specified Frequency and Item to the RankingDictionary.
  /// </summary>
  public void Add(int Frequency, ITM Item)
  {
    List<ITM> itemCollection = new List<ITM>();
    itemCollection.Add(Item);
    // If the specified Key is not found, a set operation creates a new element with the specified Key
    this._dictionary[Frequency] = itemCollection;
  }
  
  /// <summary>
  ///  Gets the number of Frequency/ItemCollection pairs contained in the RankingDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the RankingDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<int,List<ITM>>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the ItemCollection associated with the specified Frequency.
  /// </summary>
  public List<ITM> this[int Frequency]
  {
    get
    {
      List<ITM> itemCollection;
      if (this._dictionary.TryGetValue(Frequency,out itemCollection)) return itemCollection;
      else return new List<ITM>();
    }
  }
  
  /// <summary>
  /// Displays usage information for RankingDictionary 
  /// </summary>
  public override string ToString()
  {
    return "RankingDictionary<Frequency, List<Item>> : Key=\"Frequency\", Value=\"List<Item>\".";
  }
}


Usage examples will be posted later.

Wednesday, August 7, 2013

Pseudo 'random' even distribution table




In my last post, I discussed what a co-prime is and showed you to find them.

So, what's so special about relatively prime numbers? Well, then can be used to create an one-for-one distribution table that is seemingly random, but is deterministically calculated (created).

To understand what I mean, picture the face of a clock...


It has the hours 1 through 12, and if you and an hour to 12, you get 1. This can also be thought of as a single digit in a base 12 number system. Now we need a co-prime to 12. 7 is relatively prime to 12, so lets choose 7.

Starting at hour 1, if we add 7 hours, it will be 8. If we add 7 more hours, we will get 3. 7 more, 10. If we keep adding 7 hours to our clock, the hour hand will land on each of the different numbers exactly once before repeating itself, 12 steps later. Intrigued yet?

If, say, we find a co-prime to the largest number that can be represented by a byte (8-bits, 256 [also expressed as 2^8=256 or 8=Log2(256)]), we can create an array of bytes with a length of 256, containing each of the 256 different possible bytes, distributed in a seemingly random order. The discrete order, or sequence, in which each each number is visited it completely dependent on the value of the co-prime that was selected.

This table is now essentially a one-to-one, bijective mapping of one byte to another. To express this mapping to another party, say to map a stream of bytes back to their original values (decrypt), the entire table need not be exchanged, only the co-prime.

This provides a foundation for an encryption scheme who's technical requirements are similar to handling a cipher-block-chain (CBC) and its changing IV (initialization vector).

Now, it it easy to jump to the conclusion that such an encryption scheme is less secure than a CBC, but this is not necessarily the case. While this approach may be conceptually more simple, the difficulty of discovering the sequence can be made arbitrarily hard.

First of all, the number of relatively prime numbers to 256 is probably infinite. A co-prime to 256 does not have to be less than 256. Indeed, it may be several thousand time greater than 256. Additionally, any prime greater than 256 is, by definition, co-prime to 256, and likely will have a seemingly more 'random' distribution/appearance.

There is, however, a limit here. It does not have to do with the number of co-primes, but is instead limited by the number of possible sequences that can be represented by our array of 256 bytes; eventually, two different co-primes are going to map to the same unique sequence. The order matters, and we don't allow repetition to exist in our sequence. This is called a permutation without repetition, and can be expressed as 256! or 256 factorial and is instructing one to calculate the product of 256 * 255 * 254 * 253 * [...] * 6 * 5 * 4 * 3 * 2 * 1, which equals exactly this number:

857817775342842654119082271681232625157781520279485619859655650377269452553147589377440291360451408450375885342336584306157196834693696475322289288497426025679637332563368786442675207626794560187968867971521143307702077526646451464709187326100832876325702818980773671781454170250523018608495319068138257481070252817559459476987034665712738139286205234756808218860701203611083152093501947437109101726968262861606263662435022840944191408424615936000000000000000000000000000000000000000000000000000000000000000

Yeah, that's right, that number has exactly 63 zeros on the end and is 507 digits long. (As an aside, the reason there is so many zeros on the end of this number is, well for one it is highly composite, but more specifically, its prime factorization includes 2^255 and 5^63 and so 63 fives multiply with 63 of those twos to make 63 tens, and hence that many zeros.)

Above I said arbitrarily hard. So far we have only considered one table, but try and fathom the complexity of many tables. I present three different ways to use multiple tables; Nested, sequentially, and mangled.
Furthermore, the distribution tables can be discarded and replaced.

I will explain what those mean and finish this post tomorrow.


Thursday, August 1, 2013

Greatest common divisor & Co-primeness



Below are three different methods for finding the greatest common divisor of a number. I provide a method using modulus, one using Euclid's method, and one that uses the Stein method.

Believe it or not, the modulus method has the best performance. I would have guessed the Stein method. Credit goes to blackwasp for the stein method.


public static uint FindGCDModulus(uint value1, uint value2)
{
        while(value1 != 0 && value2 != 0)
        {
                if (value1 > value2)
                {
                        value1 %= value2;
                }
                else
                {
                        value2 %= value1;
                }
        }
        return Math.Max(value1, value2);
}

public static uint FindGCDEuclid(uint value1, uint value2)
{
        while(value1 != 0 && value2 != 0)
        {
                if (value1 > value2)
                {
                        value1 -= value2;
                }
                else
                {
                        value2 -= value1;
                }
        }
        return Math.Max(value1, value2);
}

public static uint FindGCDStein(uint value1, uint value2)
{
        if (value1 == 0) return value2;
        if (value2 == 0) return value1;
        if (value1 == value2) return value1;

        bool value1IsEven = (value1 & 1u) == 0;
        bool value2IsEven = (value2 & 1u) == 0;
                       
        if (value1IsEven && value2IsEven)
        {
                return FindGCDStein(value1 >> 1, value2 >> 1) << 1;
        }
        else if (value1IsEven && !value2IsEven)
        {
                return FindGCDStein(value1 >> 1, value2);
        }
        else if (value2IsEven)
        {
                return FindGCDStein(value1, value2 >> 1);
        }
        else if (value1 > value2)
        {
                return FindGCDStein((value1 - value2) >> 1, value2);
        }
        else
        {
                return FindGCDStein(value1, (value2 - value1) >> 1);
        }
}

Here are the results of my benchmark tests:

Stein: 36.4657 milliseconds.
Euclid: 31.2369 milliseconds.
Modulus: 7.5199 milliseconds.



Yeah, I couldn't believe it either, the modulus approach is the fastest!

Here is the benchmark code that I created:

public class CodeStopwatch
{
        //int m_pointer;
        Stopwatch m_elapsed;
        
        public CodeStopwatch()
        {
                Reset();
        }

        public double Stop()
        {
                m_elapsed.Stop();
                return m_elapsed.Elapsed.TotalMilliseconds;
        }
        
        public void Reset()
        {
                GC.Collect(GC.MaxGeneration);
                m_elapsed = new Stopwatch();
                m_elapsed.Start();
        }
}

public class CoprimeBenchmark
{
        public delegate bool IsCoprimeDelegate(uint value1, uint value2);
        public List<uint> FindCoprimesDelegate(uint quantity,IsCoprimeDelegate isCoprime)
        {
                List<uint> results = new List<uint>();
                if(quantity<2) {
                        return results;
                }
                
                for (uint count = 2; count < quantity; count++)
                {
                        if(isCoprime(count,quantity))
                        {
                                results.Add(count);
                        }
                }
                return results;
        }
        
        public static bool IsCoprimeModulus(uint value1, uint value2)
        {
                return FindGCDModulus(value1, value2) == 1;
        }
        
        public static bool IsCoprimeEuclid(uint value1, uint value2)
        {
                return FindGCDEuclid(value1, value2) == 1;
        }
        
        public static bool IsCoprimeStein(uint value1, uint value2)
        {
                return FindGCDStein(value1, value2) == 1;
        }

        // [ Please refer to methods above. Ommited for brevity. ]
        public static uint FindGCDModulus(uint value1, uint value2) { ... }
        public static uint FindGCDEuclid(uint value1, uint value2) { ... }
        public static uint FindGCDStein(uint value1, uint value2) { ... }
}

Usage of above classes:

// Button OnClick event
void ButtonBenchmarkClick(object sender, EventArgs e)
{
        uint number = 50000;
        List<uint> resultsStein = new List<uint>();
        List<uint> resultsEuclid = new List<uint>();
        List<uint> resultsModulus = new List<uint>();
        
        CoprimeBenchmark cop = new CoprimeBenchmark();
        
        CodeStopwatch time = new CodeStopwatch();
        resultsStein = cop.FindCoprimesDelegate(number,CoprimeBenchmark.IsCoprimeStein);
        string stein = time.Stop().ToString();
        
        time = new CodeStopwatch();
        resultsEuclid = cop.FindCoprimesDelegate(number,CoprimeBenchmark.IsCoprimeEuclid);
        string euclid = time.Stop().ToString();
        
        time = new CodeStopwatch();
        resultsModulus = cop.FindCoprimesDelegate(number,CoprimeBenchmark.IsCoprimeModulus);
        string modulus = time.Stop().ToString();
        
        textBoxOutput.Text = "Stein:\t" + stein + " milliseconds." + Environment.NewLine;
        textBoxOutput.Text += "Euclid:\t" + euclid + " milliseconds." + Environment.NewLine;
        textBoxOutput.Text += "Modulus:\t" + modulus + " milliseconds." + Environment.NewLine;
        
        textBoxOutput.Text += Environment.NewLine + Environment.NewLine;
        foreach(uint a in resultsStein)
        {
                textBoxOutput.Text += a.ToString() + Environment.NewLine;
        }
        
        textBoxOutput.Text += Environment.NewLine + Environment.NewLine;
        foreach(uint b in resultsEuclid)
        {
                textBoxOutput.Text += b.ToString() + Environment.NewLine;
        }
        
        textBoxOutput.Text += Environment.NewLine + Environment.NewLine;
        foreach(uint c in resultsModulus)
        {
                textBoxOutput.Text += c.ToString() + Environment.NewLine;
        }
}


--


Now, I will take the fastest algorithm and show you the code for finding co-primness and a list of co-primes for a given number.

Finding co-primness is simple; Two numbers are relatively prime to each other if their greatest common divisor is equal to one.

My function FindCoprimes returns a List<uint> of co-primes given a number and a range

public static class Coprimes
{
        public static List<uint> FindCoprimes(uint number,uint min,uint max)
        {
                if(max<2 || min<2 || max<=min || number<1) {    return null;    }
                
                List<uint> results = new List<uint>();
                for(uint counter = min; counter<max; counter++) {
                        if(IsCoprime(number,counter)) {
                                results.Add(counter);
                        }
                }
                return results;
        }
       
        public static bool IsCoprime(uint value1, uint value2)
        {
                return FindGCDModulus(value1, value2) == 1;
        }
}

Pretty easy!

What is so special about co-primes? Well, one thing I use co-primes for is for making an evenly distributed table. For an explanation and the code, see my post on Evenly Distributed Tables.



Tuesday, July 9, 2013

Procedural generation of cave-like maps for rogue-like games



This post is about procedural content generation of cave-like dungeons/maps for rogue-like games using what is known as the Cellular Automata method.

To understand what I mean by cellular automata method, imagine Conway's Game of Life. Many algorithms use what is called the '4-5 method', which means a tile will become a wall if it is a wall and 4 or more of its nine neighbors are walls, or if it is not a wall and 5 or more neighbors are walls. I start by filling the map randomly with walls or space, then visit each x/y position iteratively and apply the 4-5 rule. Usually this is preceded with 'seeding' the map by randomly filling each cell of the map with a wall or space, based on some weight (say, 40% of the time it chooses to place a wall). Then the automata step is applied multiple times over the entire map, precipitating walls and subsequently smoothing them. About 3 rounds is all that is required, with about 4-5 rounds being pretty typical amongst most implementations. Perhaps a picture of the the output will help you understand what I mean.

Using the automata-based method for procedural generation of levels will produce something similar to this:

Sure the dungeon generation by linking rooms approach has its place, but I really like the 'natural' look to the automata inspired method. I first originally discovered this technique on the website called Roguebasin. It is a great resource for information concerning the different problems involved with programming a rogue-like game, such as Nethack or Angband.

One of the major problems developers run into while employing this technique is the formation of isolated caves. Instead of one big cave-like room, you may get section of the map that is inaccessible without digging through walls. Isolated caves can trap key items or (worse) stairs leading to the next level, preventing further progress in the game. I have seen many different approaches proposed to solve this problem. Some suggestions I've seen include: 1) Discarding maps that have isolated caves, filling in the isolated sections, or finely tweaking the variables/rules to reduce occurrences of such maps. None of these are ideal (in my mind), and most require a way to detect isolated sections, which is another non-trivial problem in itself.

Despite this, I consider the problem solved because I have discovered a solution that is dead simple and almost** never fail because of the rules of the automata generation itself dictate such. I call my method 'Horizontal Blanking' because you can probably guess how it works now just from hearing the name. This step comes after the random filling of the map (initialization), but before the cellular automata iterations. After the map is 'seeded' with a random fill of walls, a horizontal strip in the middle of of the map is cleared of all walls. The horizontal strip is about 3 or 4 block tall (depending on rules). Clearing a horizontal strip of sufficient width will prevent a continuous vertical wall from being created and forming isolated caves in your maps. After horizontal blanking, you can begin applying the cellular automata method to your map.

** I say 'almost' because although it it not possible to get whole rooms that are disconnected from each other, it is possible to get tiny squares of blank space in the northern or southern walls that consist of about 4-5 blocks in total area. Often, these little holes will resolve themselves during the rounds of automata rules, but there still exists the possibility that one may persist. My answer to this edge case would be to use some rules around the placement of stairwells (and other must-find items) dictating that such objects must have a 2-3 block radius clear of walls to be placed.


See below for the code that produced the above screenshot, or click here to download the entire project with source code that you can compile yourself.

public class MapHandler
{
 Random rand = new Random();
 
 public int[,] Map;
 
 public int MapWidth   { get; set; }
 public int MapHeight  { get; set; }
 public int PercentAreWalls { get; set; }
 
 public MapHandler()
 {
  MapWidth = 40;
  MapHeight = 21;
  PercentAreWalls = 40;
  
  RandomFillMap();
 }

 public void MakeCaverns()
 {
  // By initilizing column in the outter loop, its only created ONCE
  for(int column=0, row=0; row <= MapHeight-1; row++)
  {
   for(column = 0; column <= MapWidth-1; column++)
   {
    Map[column,row] = PlaceWallLogic(column,row);
   }
  }
 }
 
 public int PlaceWallLogic(int x,int y)
 {
  int numWalls = GetAdjacentWalls(x,y,1,1);

  
  if(Map[x,y]==1)
  {
   if( numWalls >= 4 )
   {
    return 1;
   }
   return 0;   
  }
  else
  {
   if(numWalls>=5)
   {
    return 1;
   }
  }
  return 0;
 }
 
 public int GetAdjacentWalls(int x,int y,int scopeX,int scopeY)
 {
  int startX = x - scopeX;
  int startY = y - scopeY;
  int endX = x + scopeX;
  int endY = y + scopeY;
  
  int iX = startX;
  int iY = startY;
  
  int wallCounter = 0;
  
  for(iY = startY; iY <= endY; iY++) {
   for(iX = startX; iX <= endX; iX++)
   {
    if(!(iX==x && iY==y))
    {
     if(IsWall(iX,iY))
     {
      wallCounter += 1;
     }
    }
   }
  }
  return wallCounter;
 }
 
 bool IsWall(int x,int y)
 {
  // Consider out-of-bound a wall
  if( IsOutOfBounds(x,y) )
  {
   return true;
  }
  
  if( Map[x,y]==1  )
  {
   return true;
  }
  
  if( Map[x,y]==0  )
  {
   return false;
  }
  return false;
 }
 
 bool IsOutOfBounds(int x, int y)
 {
  if( x<0 data-blogger-escaped-else="" data-blogger-escaped-if="" data-blogger-escaped-return="" data-blogger-escaped-true="" data-blogger-escaped-x="" data-blogger-escaped-y="">MapWidth-1 || y>MapHeight-1 )
  {
   return true;
  }
  return false;
 }
Above is the main core of the logic.


Here is the rest of the program, such as filling, printing and blanking:
 
 public void PrintMap()
 {
  Console.Clear();
  Console.Write(MapToString());
 }
 
 string MapToString()
 {
  string returnString = string.Join(" ", // Seperator between each element
                                    "Width:",
                                    MapWidth.ToString(),
                                    "\tHeight:",
                                    MapHeight.ToString(),
                                    "\t% Walls:",
                                    PercentAreWalls.ToString(),
                                    Environment.NewLine
                                   );
  
  List<string> mapSymbols = new List();
  mapSymbols.Add(".");
  mapSymbols.Add("#");
  mapSymbols.Add("+");
  
  for(int column=0,row=0; row < MapHeight; row++ ) {
   for( column = 0; column < MapWidth; column++ )
   {
    returnString += mapSymbols[Map[column,row]];
   }
   returnString += Environment.NewLine;
  }
  return returnString;
 }
 
 public void BlankMap()
 {
  for(int column=0,row=0; row < MapHeight; row++) {
   for(column = 0; column < MapWidth; column++) {
    Map[column,row] = 0;
   }
  }
 }
 
 public void RandomFillMap()
 {
      // New, empty map
      Map = new int[MapWidth,MapHeight];
  
      int mapMiddle = 0; // Temp variable
      for(int column=0,row=0; row < MapHeight; row++) {
         for(column = 0; column < MapWidth; column++)
         {
       // If coordinants lie on the edge of the map
       // (creates a border)
       if(column == 0)
       {
      Map[column,row] = 1;
       }
       else if (row == 0)
       {
      Map[column,row] = 1;
       }
       else if (column == MapWidth-1)
       {
      Map[column,row] = 1;
       }
       else if (row == MapHeight-1)
       {
      Map[column,row] = 1;
       }
       // Else, fill with a wall a random percent of the time
       else
       {
      mapMiddle = (MapHeight / 2);
    
      if(row == mapMiddle)
      {
     Map[column,row] = 0;
      }
      else
      {
     Map[column,row] = RandomPercent(PercentAreWalls);
      }
       }
   }
      }
 }

 int RandomPercent(int percent)
 {
  if(percent>=rand.Next(1,101))
  {
   return 1;
  }
  return 0;
 }
 
 public MapHandler(int mapWidth, int mapHeight, int[,] map, int percentWalls=40)
 {
  this.MapWidth = mapWidth;
  this.MapHeight = mapHeight;
  this.PercentAreWalls = percentWalls;
  this.Map = new int[this.MapWidth,this.MapHeight];
  this.Map = map;
 }
}


And of course, the main function:
 
public static void Main(string[] args)
{
 char key  = new Char();
 MapHandler Map = new MapHandler();
 
 string instructions =
  "[Q]uit [N]ew [+][-]Percent walls [R]andom [B]lank" + Environment.NewLine +
  "Press any other key to smooth/step";

 Map.MakeCaverns();
 Map.PrintMap();
 Console.WriteLine(instructions);
 
 key = Char.ToUpper(Console.ReadKey(true).KeyChar);
 while(!key.Equals('Q'))
 {
  if(key.Equals('+')) {
   Map.PercentAreWalls+=1;
   Map.RandomFillMap();
   Map.MakeCaverns();
   Map.PrintMap();
  } else if(key.Equals('-')) {
   Map.PercentAreWalls-=1;
   Map.RandomFillMap();
   Map.MakeCaverns();
   Map.PrintMap();
  } else if(key.Equals('R')) {
   Map.RandomFillMap();
   Map.PrintMap();
  } else if(key.Equals('N')) {
   Map.RandomFillMap();
   Map.MakeCaverns();
   Map.PrintMap();
  } else if(key.Equals('B')) {
   Map.BlankMap();
   Map.PrintMap();
  } else if(key.Equals('D')) {
   // I set a breakpoint here...
  } else {
   Map.MakeCaverns();
   Map.PrintMap();
  }
  Console.WriteLine(instructions);
  key = Char.ToUpper(Console.ReadKey(true).KeyChar);
 }
 Console.Clear();
 Console.Write(" Thank you for playing!");
 Console.ReadKey(true);
}


See also: Roguebasin - Cellular Automata Method for Generating Random Cave-Like Levels


Thursday, June 27, 2013

Fake/Random Identity Generator




Inspiration


During my research on RSA cryptography and the importance of a truly random number for having a large key-space, I stumbled on to FakeNameGenerator.com. I thought the concept could be really useful for certain applications and could easily envision how to implement it in C#, and make it extensible/customizable.

Take a look at these:

<?xml version="1.0" standalone="yes"?>
<DocumentElement>
  <Order>
    <Date>3/18/2005</Date>
    <TrackingNumber>1Z 8A8 238 01 9398 182 1</TrackingNumber>
    <FirstName>Keaton </FirstName>
    <LastName>Day</LastName>
    <StreetAddress>4828 Cherry St.</StreetAddress>
    <City>Nanticoke</City>
    <State>SC</State>
    <Zip>89130</Zip>
    <Email>HaleHale8026@mail.com</Email>
    <Phone>425-765-4520</Phone>
  </Order>

  <Payroll>
    <PhoneNumber>971-258-5703</PhoneNumber>
    <AltPhoneNumber>501-769-1331</AltPhoneNumber>
    <FirstName>Xyla </FirstName>
    <LastName>Hoover</LastName>
    <EmployeeID>499</EmployeeID>
    <HireDate>5/28/2011</HireDate>
    <Birthdate>5/28/1990</Birthdate>
    <SSN>520-52-4275</SSN>
    <AccountNumber>5696618825</AccountNumber>
    <RoutingNumber>575159859</RoutingNumber>
    <Address>8348 Court Ave.</Address>
    <City>Pittsburgh,</City>
    <State>PA.</State>
    <Zip>15201</Zip>
  </Payroll>

 CREATE TABLE ReservationData (
  `id` mediumint(8) unsigned NOT NULL auto_increment,
  `UniqueID` MEDIUMINT default NULL,
  `TripDate` varchar(50) default NULL,
  `FirstName` varchar(255) default NULL,
  `LastName` varchar(255) default NULL,
  `Phone` varchar(100) default NULL,
  `AltPhone` varchar(100) default NULL,
  `Email` varchar(255) default NULL,
  `StreetAddress` varchar(255) default NULL,
  `City` varchar(50) default NULL,
  `State` varchar(50) default NULL,
  `Zip` varchar(10) default NULL,
  `Country` varchar(255) default NULL,
  `DayOfYear` varchar(50) default NULL,
  `TotalCost` varchar(50) default NULL,
  `Balance` varchar(10) default NULL,
  `CCard` varchar(18) default NULL,
  `Expires` varchar(5) default NULL,
  `CVC2` varchar(3) default NULL,
  PRIMARY KEY (`id`)
) TYPE=MyISAM AUTO_INCREMENT=1;

This would make great honey for a honey pot; just fill an SQL database with this random, realistic looking data, serve and log any and all attempts to access, query or dump the database. This can be done on a VM and you have a easily deployed, high interaction honeypot!

Aside from being able to see their IP address, I think the most useful data that can be attained is their behavior; what injection attacks are they using to drop the database? Write rules to prevent your honey against trivial attempts such as the ' AND 1=(SELECT attacks and see what they come up with next. Rule writing is inherently a cat-and-mouse game, honeypots like this clearly give the white-hats the upper hand.



Implementation



A quick, fast and dirty solution is to simply read a random line from a text file (i.e. Name_First.txt and Address_Street.txt).

This way, you can choose from names that are common, or customize your list to for different nationalities.

One could read the whole file in to a string, Parse() it into an array of strings, then randomly select an index, but this would be unacceptable for very large files. Instead, we can set the file pointer to a random position that is less than its size, roll back to the last new line and call ReadLine.




public string ReturnRandomLine(string FileName)
{
 string sReturn = string.Empty;
 
 using(FileStream myFile = new FileStream(FileName,FileMode.Open,FileAccess.Read))
 {
  using(StreamReader myStream = new StreamReader(myFile))
  {
   // Seek file stream pointer to a rand position...
   myStream.BaseStream.Seek(rand.Next(1,myFile.Length),SeekOrigin.Begin);
   
   // Read the rest of that line.
   myStream.ReadLine();
   
   // Return the next, full line...
   sReturn = myStream.ReadLine();
  }
 }
 
 // If our random file position was too close to the end of the file, it will return an empty string
 // I avoided a while loop in the case that the file is empty or contains only one line
 if(System.String.IsNullOrWhiteSpace(sReturn)) {
  sReturn = ReturnRandomLine(FileName);
 }
 
 return sReturn;
}

Example use:


public string GenerateFistName()
{
 return ReturnRandomLine("Name_First.txt") + " ";
}

public string GenerateLastName()
{
 return ReturnRandomLine("Name_Last.txt");
}

public string GenerateFullName()
{
 return GenerateFistName() + GenerateLastName();
}

public string GenerateGender()
{
 if(ReturnPercent(84)) {
  return "Male";
 } else {
  return "Female";
 }
}

public string GenerateStreetNumber()
{
 return rand.Next(1,9999).ToString();
}

public string GenerateStreetName()
{
 return ReturnRandomLine("Address_Street.txt");
}

One limitation is where the data is relational, such as in the case of generating a random zip code along with the city and state that it exists in.

A quick work-around would be CityZipState.txt



Other types of data that can be generated that would not make sense to put in a text file:


public bool ReturnPercent(int Percent) // Return true Percent times out of 100, randomly
{
 int iTemp = rand.Next(1,101);
 
 if(iTemp<=Percent) {
  return true;
 } else {
  return false;
 }
}

public string GenerateDate(int YearFrom,int YearTo)
{
 int Month = rand.Next(1,13);
 int Day  = rand.Next(1,32);
 int Year = GenerateYear(YearFrom,YearTo);
 
 return Month.ToString() + "/" + Day.ToString() + "/" + Year.ToString();
}

public string GenerateYear(int YearFrom,int YearTo)
{
 return rand.Next(YearFrom,YearTo+1).ToString();
}

public string GeneratePhoneNumber()
{
 return GeneratePhoneNumber(ReturnRandomLine("PhoneNumber_Prefix.txt"));
}

public string GeneratePhoneNumber(string Prefix)
{
 int iThree = rand.Next(192,999);
 int iFour = rand.Next(1000,9999);
 
 return Prefix + iThree.ToString() + "-" + iFour.ToString();
}

public string GenerateSSN()
{
 int iThree = rand.Next(132,921);
 int iTwo = rand.Next(12,83);
 int iFour = rand.Next(1423,9211);
 return iThree.ToString() + "-" + iTwo.ToString() + "-" + iFour.ToString();
}

Obviously, these methods can be improved to conform to the standards of a real social security number, national identification number, credit card number, ect...


public string GenerateCCNum()
{
 string sCCNum = string.Empty;
 
 byte[] bCCNum = {0};
 rand.NextBytes(bCCNum);
 
 // generate random 16 digit number
 int iTemp1 = rand.Next(10000000,99999999);
 int iTemp2 = rand.Next(10000000,99999999);
 string sTemp = iTemp1.ToString() + iTemp2.ToString();
 
 // while loop?
 while(!IsValidNumber(sTemp))
 {
  iTemp1 = rand.Next(10000000,99999999);
  iTemp2 = rand.Next(10000000,99999999);
  sTemp = iTemp1.ToString() + iTemp2.ToString();
 }
 
 sCCNum = sTemp;
 
 return sCCNum;
}

The implementation of IsValidNumber() is left as an exercise for the reader.

The serialization of your data is a trivial matter. Please see my post on a XML Serializable Dictionary, Tuple, and Object for the code to serialize an object (such as a list, or a class).