Showing posts with label List. Show all posts
Showing posts with label List. Show all posts

Wednesday, December 3, 2014

Word Frequency Dictionary & Sorted Occurrence Ranking Dictionary for Generic Item Quantification and other Statistical Analyses



Taking a little inspiration from DotNetPerl's MultiMap, I created a generic class that uses a Dictionary under the hood to create a keyed collection that tracks the re-occurrences or frequency matching or duplicate items of arbitrary type. For lack of a better name, I call it a FrequencyDicionary. Instead of Key/Value pairs, the FrequencyDictionary has the notion of a Item/Frequency pair, with the key being the Item. I strayed from Key/Value because I may ultimately end up calling the Item the Value.

I invented the FrequencyDictionary while writing a pisg-like IRC log file statistics generator program. It works well for word frequency analysis. After some pre-processing/cleaning of the text log file, I parse each line by spaces to get an array of words. I then use FrequencyDictionary.AddRange to add the words to my dictionary.

The FrequencyDictionary is essentially a normal Dictionary (with T being type String in this case), however when you attempt to add a key that already exists, it simply increments the Value integer. After processing the file, you essentially have a Dictionary where the Key is a word that was found in the source text, and the Value is the number of occurrences of that word in the source text.

Taking the idea even further, I created complementary Dictionary that essentially a sorted version of FrequencyDictionary with the Key/Values reversed. I call it a RankingDictionary because the Keys represent the number of occurrences of a word, with the Values being a List of words that occurred that many times in the source text. Its called a RankingDictionary because I was using it to produce a top 10 ranking of most popular words, most popular nicks, ect.

The FrequencyDictionary has a GetRankingDictionary() method in it to make the whole thing very easy to use. Typically I don't use the FrequencyDictionary too extensively, but rather as a means to get a RankingDictionary, which I base the majority of my IRC statistics logic on. The RankingDictionary also proved very useful for finding Naked Candidates and Hidden Pairs or Triples in my Sudoku solver application that I will be releasing on Windows, Windows Phone and will be blogging about shortly. Hell, I was even thinking about releasing the source code to my Sudoku App, since its so elegant and a great example of beautiful, readable code to a complex problem.

Anyways, the code for the Frequency and Ranking Dictionary is heavily commented with XML Documentation Comments, so I'm going to go ahead and let the code speak for itself. I will add usage examples later. In fact, I will probably release the pisg-like IRC Stats Prog source code since I don't think I'm going to go any farther with it.

Limitations: I ran into problems trying to parse large text files approaching 3-4 megabytes. Besides taking up a bunch of memory, the Dictionary begins to encounter many hash collisions once the size of the collection gets large enough. This completely kills performance, and eventually most the time is spent trying to resolve collisions, so it grinds to a near halt. You might notice the constructor public FrequencyDictionary(int Capacity), where you can specify a maximum capacity for the Dictionary. A good, safe ceiling is about 10,000. A better implementation of the GetHash() method might be in order, but is not a problem I have felt like solving yet.




/// <summary>
/// A keyed collection of Item/Frequency pairs, (keyed off Item).
/// If a duplicate Item is added to the Dictionary, the Frequency for that Item is incremented.
/// </summary>
public class FrequencyDictionary<ITM>
{
  // The underlying Dictionary
  private Dictionary<ITM, int> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public FrequencyDictionary() { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that has a maximum Capacity limit.
  /// </summary>
  public FrequencyDictionary(int Capacity) { _dictionary = new Dictionary<ITM, int>(); }
  
  /// <summary>
  /// Gets a collection containing the Items in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<ITM> ItemCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the FrequencyDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Values;} }
  
  /// <summary>
  /// Adds the specified Item to the FrequencyDictionary.
  /// </summary>
  public void Add(ITM Item)
  {
    if  ( this._dictionary.ContainsKey(Item))   { this._dictionary[Item]++; }
    else    { this._dictionary.Add(Item,1); }
  }
  
  /// <summary>
  /// Adds the elements of the specified array to the FrequencyDictionary.
  /// </summary>
  public void AddRange(ITM[] Items)
  {
    foreach(ITM item in Items) { this.Add(item); }
  }
  
  /// <summary>
  /// Gets the Item that occurs most frequently.
  /// </summary>
  /// <returns>A KeyValuePair containing the Item (key) and how many times it has appeard (value).</returns>
  public KeyValuePair<ITM,int> GetMostFrequent()
  {
    int maxValue = this._dictionary.Values.Max();
    return this._dictionary.Where(kvp => kvp.Value == maxValue).FirstOrDefault();
  }
  
  /// <summary>
  /// Gets the number of Item/Frequency pairs contained in the FrequencyDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the FrequencyDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<ITM,int>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the Frequency (occurrences) associated with the specified Item.
  /// </summary>
  public int this[ITM Item]
  {
    get { if (this._dictionary.ContainsKey(Item)) { return this._dictionary[Item]; } return 0; }
  }
  
  /// <summary>
  /// Creates a RankingDictionary from the current FrequencyDictionary.
  /// </summary>
  /// <returns>A RankingDictionary of Frequency/ItemCollection pairs ordered by Frequency.</returns>
  public RankingDictionary<ITM> GetRankingDictionary()
  {
    RankingDictionary<ITM> result = new RankingDictionary<ITM>();
    foreach(KeyValuePair<ITM,int> kvp in _dictionary)
    {
      result.Add(kvp.Value,kvp.Key);
    }
    return result;
  }
  
  /// <summary>
  /// Displays usage information for FrequencyDictionary 
  /// </summary>
  public override string ToString()
  {
    return "FrequencyDictionary<Item, Frequency> : Key=\"Item=\", Value=\"Frequency\"\".";
  }
}




And now the RankingDictionary:

/// <summary>
/// A keyed collection of Frequency/ItemCollection pairs that is ordered by Frequency (rank).
/// If an Item is added that has the same Frequency as another Item, that Item is added to the Item collection for that Frequency.
/// </summary>
public class RankingDictionary<ITM>
{
  // Underlying dictionary
  SortedDictionary<int,List<ITM>> _dictionary;
  
  /// <summary>
  /// Initializes a new instance of the FrequencyDictionary that is empty.
  /// </summary>
  public RankingDictionary() { _dictionary = new SortedDictionary<int,List<ITM>>(new FrequencyComparer()); }
  
  /// <summary>
  /// The Comparer used to compare Frequencies.
  /// </summary>
  public class FrequencyComparer : IComparer<int>
  {
    public int Compare(int one,int two) { if(one == two) return 0; else if(one > two) return -1; else return 1; }
  }
  
  /// <summary>
  /// Gets a collection containing the Frequencies in the RankingDictionary.
  /// </summary>
  public IEnumerable<int> FrequencyCollection { get { return this._dictionary.Keys; } }
  
  /// <summary>
  /// Gets a collection containing the ItemCollection in the RankingDictionary.
  /// </summary>
  public IEnumerable<List<ITM>> ItemCollections   { get { return this._dictionary.Values; } }
  
  /// <summary>
  /// Adds the specified Frequency and Item to the RankingDictionary.
  /// </summary>
  public void Add(int Frequency, ITM Item)
  {
    List<ITM> itemCollection = new List<ITM>();
    itemCollection.Add(Item);
    // If the specified Key is not found, a set operation creates a new element with the specified Key
    this._dictionary[Frequency] = itemCollection;
  }
  
  /// <summary>
  ///  Gets the number of Frequency/ItemCollection pairs contained in the RankingDictionary.
  /// </summary>
  public int Count { get { return this._dictionary.Count; } }
  
  /// <summary>
  /// Returns an enumerator that iterates through the RankingDictionary.
  /// </summary>
  public IEnumerator<KeyValuePair<int,List<ITM>>> GetEnumerator()
  {
    return this._dictionary.GetEnumerator();
  }
  
  /// <summary>
  /// Gets the ItemCollection associated with the specified Frequency.
  /// </summary>
  public List<ITM> this[int Frequency]
  {
    get
    {
      List<ITM> itemCollection;
      if (this._dictionary.TryGetValue(Frequency,out itemCollection)) return itemCollection;
      else return new List<ITM>();
    }
  }
  
  /// <summary>
  /// Displays usage information for RankingDictionary 
  /// </summary>
  public override string ToString()
  {
    return "RankingDictionary<Frequency, List<Item>> : Key=\"Frequency\", Value=\"List<Item>\".";
  }
}


Usage examples will be posted later.

Thursday, June 27, 2013

XML Serializable Dictionary, Tuple, and Object




Serializing a data class to an XML file using XmlSerializer is very useful. However, some of the most useful data classes in .NET are not serializable. Dictionary and Tuple most notably. If you looking for a blog post on how to make a Dictionary that accepts duplicate keys by storing the values values with identical keys in a List, please see this blog post.

The below SerializableDictionary class works by inheriting from IXmlSerializable, which requires you implement the following three methods:
* GetSchema() - Remember, you should always return null for this function.
* ReadXml(XmlReader reader)
* WriteXml(XmlWriter writer)
(Read about IXmlSerializable on MSDN)

Here is the code to serialize a dictionary or serialize a tuple:

namespace XMLSerializableDictionary
{
    using System;
    using System.Collections.Generic;
    using System.Runtime.Serialization;
    using System.Xml;
    using System.Xml.Schema;
    using System.Xml.Serialization;

     [Serializable]
    [XmlRoot("Dictionary")]
    public class SerializableDictionary<TKey, TValue>
        : Dictionary<TKey, TValue>, IXmlSerializable
    {
        private const string DefaultTagItem = "Item";
        private const string DefaultTagKey = "Key";
        private const string DefaultTagValue = "Value";
        private static readonly XmlSerializer KeySerializer =
                                        new XmlSerializer(typeof(TKey));

        private static readonly XmlSerializer ValueSerializer =
                                        new XmlSerializer(typeof(TValue));

        public SerializableDictionary() : base()
        {
        }

        protected SerializableDictionary(SerializationInfo info, StreamingContext context)
                : base(info, context)
        {
        }

        protected virtual string ItemTagName
        {
            get { return DefaultTagItem; }
        }

        protected virtual string KeyTagName
        {
            get { return DefaultTagKey; }
        }

        protected virtual string ValueTagName
        {
            get { return DefaultTagValue; }
        }

        public XmlSchema GetSchema()
        {
            return null;
        }

        public void ReadXml(XmlReader reader)
        {
            bool wasEmpty = reader.IsEmptyElement;

            reader.Read();

            if (wasEmpty)
            {
             return;
            }

            try
            {
                while (reader.NodeType != XmlNodeType.EndElement)
                {
                    reader.ReadStartElement(this.ItemTagName);
                    try
                    {
                        TKey tKey;
                        TValue tValue;

                        reader.ReadStartElement(this.KeyTagName);
                        try
                        {
                            tKey = (TKey)KeySerializer.Deserialize(reader);
                        }
                        finally
                        {
                            reader.ReadEndElement();
                        }

                        reader.ReadStartElement(this.ValueTagName);
                        try
                        {
                            tValue = (TValue)ValueSerializer.Deserialize(reader);
                        }
                        finally
                        {
                            reader.ReadEndElement();
                        }

                        this.Add(tKey, tValue);
                    }
                    finally
                    {
                        reader.ReadEndElement();
                    }

                    reader.MoveToContent();
                }
            }
            finally
            {
                reader.ReadEndElement();
            }
        }

        public void WriteXml(XmlWriter writer)
        {
            foreach (KeyValuePair<TKey, TValue> keyValuePair in this)
            {
                writer.WriteStartElement(this.ItemTagName);
                try
                {
                    writer.WriteStartElement(this.KeyTagName);
                    try
                    {
                        KeySerializer.Serialize(writer, keyValuePair.Key);
                    }
                    finally
                    {
                        writer.WriteEndElement();
                    }

                    writer.WriteStartElement(this.ValueTagName);
                    try
                    {
                        ValueSerializer.Serialize(writer, keyValuePair.Value);
                    }
                    finally
                    {
                        writer.WriteEndElement();
                    }
                }
                finally
                {
                    writer.WriteEndElement();
                }
            }
        }
    }
}


The idea behind the serializable tuple is we just make our own Tuple that stores the items by declaring the properties to represent them with their generic T type. If you are not used to working with generics, this can be a little strange. T1, T2 and T3 are just placeholders for the type that is to be determined by the calling function, or the function above that if the calling function uses generics too.

And a serializable tuple:

public class SerializableTuple<T1,T2,T3>
{
 public T1 Item1 { get; set; }
 public T2 Item2 { get; set; }
 public T3 Item3 { get; set; }
 
 public static implicit operator Tuple<T1,T2,T3>(SerializableTuple<T1,T2,T3>  st)
 {
  return Tuple.Create(st.Item1,st.Item2,st.Item3);
 }

 public static implicit operator SerializableTuple<T1,T2,T3>(Tuple<T1,T2,T3> t)
 {
  return new SerializableTuple<T1,T2,T3>() {
   Item1 = t.Item1,
   Item2 = t.Item2,
   Item3 = t.Item3
  };   
 }
 
 public SerializableTuple()
 {
 }
}


And finally, a generic object serializer and deserializer:


public static class XML
{
   public static class Serialize
   {
      public static void Object(string Filename, object obj)
      {
         using (StreamWriter streamWriter = new StreamWriter(Filename))
         {
            XmlSerializer xmlSerializer = new XmlSerializer(obj.GetType());
            xmlSerializer.Serialize(streamWriter, obj);
         }
      }
   }

   public static class DeSerialize
   {
      public static string Generic<T>(T data)
      {
         if (data == null)
            return string.Empty;

         string content = string.Empty;
         using (MemoryStream memoryStream = new MemoryStream())
         {
            XmlSerializer serializer = new XmlSerializer(typeof(T));
            serializer.Serialize(memoryStream, data);

            memoryStream.Seek(0, SeekOrigin.Begin);
            using (StreamReader reader = new StreamReader(memoryStream))
            {
               content = reader.ReadToEnd();
            }
         }
         return content;
      }

      public static object Object(string Filename, Type type)
      {
         object result = null;
         using (TextReader reader = new StringReader(Filename))
         {
            XmlSerializer serializer = new XmlSerializer(type);
            result = serializer.Deserialize(reader);
         }
         return result;
      }
   }
}

And perhaps after you serialize your data to an XML file, you would like to generate a schema XML file from it:
  
void XmlToSchema(string FileName)
{
 XmlReader xmlReader = XmlReader.Create(FileName);
 XmlSchemaSet schemaSet = new XmlSchemaSet();
 XmlSchemaInference schemaInfer = new XmlSchemaInference();
 schemaSet = schemaInfer.InferSchema(xmlReader);

 string outFilename = Path.ChangeExtension(FileName,".xsd");
 using(Stream streamOut = new FileStream(outFilename,FileMode.Create) )
 {
  TextWriter textWriter =  new StreamWriter(streamOut);    
  foreach (XmlSchema s in schemaSet.Schemas())
  {
   s.Write(textWriter );
  }
  textWriter .Close();
 }
}