Showing posts with label StreamReader. Show all posts
Showing posts with label StreamReader. Show all posts

Saturday, June 20, 2015

Lazy IEnumerable file reader class: When to inherit from IDispose?




    Need to read just the few couple lines from a gigantic data file? Or maybe you need a forward-only, load-only-what-you-need file reader pattern? Here is a construct I have been toying with, its a class that treats a file stream like an IEnumerable.

    Note added 8/1/15: TODO: Add constructor overload that accepts the starting filepointer position, optional ending filepointer position.

    This has the benefit of not using any resources if you never use it, allows you to incrementally read gigantic files without loading it all into memory (you might need to call the StreamReader.DiscardBufferedData() method every once in a while), and because its IEnumerable, you can write queries against it that are lazy; they don't actually execute until the run-time actually NEEDS it, such as calling the IEnumerable.ToList() or 'Count()' extensions, for example. Be careful with ToList() if the file is a gigabyte or more, as calling ToList() will cause the whole thing to be read right then.

    If instead you just need to iterate through each line only until you find what you are looking for, or use Linq and a predicate to search for a particular line that satisfies a condition, then this pattern will save your application from having the load the whole thing in memory:

public class EnumerableFileReader
{
    public FileInfo File { get { return _file; } }
    public bool FileExists { get { return _file.Exists; } }

    private FileInfo _file;

    public EnumerableFileReader(string fileLocation)
        : this(new FileInfo(fileLocation))
    {
    }

    public EnumerableFileReader(FileInfo file)
    {
        if (!file.Exists)
        {
            throw new FileNotFoundException();
        }

        _file = file;
    }

    public IEnumerable FileLines()
    {
        if (!FileExists) yield break;

        string line;
        //long internalBufferSize = 0;

        using (StreamReader reader = _file.OpenText())
        {
            while ((line = reader.ReadLine()) != null)
            {
                //if (internalBufferSize++ > 90000) {   reader.DiscardBufferedData(); internalBufferSize = 0; }
                yield return line;
            }
        }

        yield break;
    }
}

    It struck me that it might be a good idea to make the class inherit from IDisposable, so the StreamReader doesn't get left around in memory, holding a file open. Indeed; all those yield keywords make it look like the stream object will just hang around there if FileLines is never called again to finish the enumeration. However, it turns out this is probably not necessary but the answer is, as you might expect: IT DEPENDS. It depends... on how you are going to use the class. Looking into the subject, I discovered that when you use the yield keyword, the compiler generates a nested class which implements the IEnumerable, IEnumerator and IDisposable interfaces and stores all context data for you under the hood. I'm not going to drop the IL (or CIL) here, but if you are curious, just open up your IEnumerable class in ILSpy. Just make sure you change the language in the drop-down box at the top from C# to IL, otherwise it will be hidden.

    So just when is our class disposed of? Well anytime you explicitly call Dispose on the Enumerator or the Stream, which one might expect. However, this will dispose of a lot more than just the stream or the enumerator alone, but all of the associated constructs that are generated around the enumerator/yield pattern. to be disposed of. Dispose will also be called at the end of enumeration. This includes when you run out of things to enumerate, any time you use the yield break or return keyword, or the program flow leaves the using statement surrounding the Stream. Here is something I didn't know: Dispose is also called when you call the IEnumerable.First() or FirstOrDefault() extension. Also, any time you use a foreach loop around the IEnumerator, the object will get disposed after you are done looping or after you break or leave the loop.

  
So, in short: As long as you're using LINQ extensions or foreach loops, the disposal is taken care of for you automatically. However, if you are manually calling the Enumerator().MoveNext() on the method, then you need to call dispose yourself, or rather, implement the IDisposable pattern.

    Being able to use EnumerableFileReader in a using statement/disposable pattern would likely be expected of a file reader class. You could have your dispose method set a boolean flag and then call FileLines(), and add an if statement in the while look of your FileLines() method that will yield break when the dispose flag is set to true, but cleaning up properly can be tricky if your IEnumerator has more than one or two return yield statements. I would instead suggest that that we use one of the tricks we just leaned above and just have our Dispose() function call .FirstOrDefault() on the FileLines() method:


public class EnumerableFileReader : IDisposable
{
[...]

    public void Dispose()
    {
        FileLines().FirstOrDefault();
    }

[...]
}