Using Iterator Blocks (C#)

Iterator blocks (more broadly called generators) are a special syntax for producing an enumerator. They allow fine-grained control over iteration, including:

  • Controlling the order of iteration

  • Modifying a collection during iteration

  • Loading a large collection in batches

  • Creating the collection as it is consumed

  • Implementing infinite sequences

An iterator block is implemented by creating a method or property getter that returns IEnumerable and using yield return statements.

Here’s an iterator that returns even numbers in the range [firstNumber..lastNumber].

IEnumerable<int> EvenSequence(int firstNumber, int lastNumber)
{
    for (int number = firstNumber; number <= lastNumber; number++)
    {
        if (number % 2 == 0)
        {
            yield return number;
        }
    }
}

// usage:
foreach (int number in EvenSequence(5, 18))
{
    Console.Write(number.ToString() + " ");
}
// Output: 6 8 10 12 14 16 18

Behind the scenes the compiler replaces an iterator block with a nested class that implements IEnumerable and IEnumerator. When MoveNext() is called on this class, code is executed until it either reaches a yield return, yield break, or the end of the block. A yield return statement sets the Current property to the return value and returns true from MoveNext(). Yield break or reaching the end of the code block causes MoveNext() to return false. On each call to MoveNext(), code execution resumes after the previous yield return statement. The internal state of the iterator block is maintained between calls. (See my previous post on foreach if this did not make sense.)

Note, the Reset() method is not supported when consuming an iterator block.

Infinite Sequences

You can leverage an iterator block to write an infinite sequence generator. Here is a Fibonacci sequence generator:

IEnumerable<long> Fibonacci()
{
    yield return 1;
    yield return 2;
    long n = 1, m = 2;
    while (true)
    {
        var next = n + m;
        yield return next;
        n = m;
        m = next;
    }
}

The first time MoveNext() is called this yields 1. The second time it yields 2. The third time all the code through `yield return next;` is executed. The enumerator’s Current property is set to the value of `next` and MoveNext() returns true. Each subsequent call to MoveNext() resumes at `n = m`. Since the loop never exits, you can call MoveNext() an unlimited number of times.

The code below consumes Fibonacci() to calculate the sum of all even Fibonacci numbers less than or equal to 4,000,000.

long sum = 0;
foreach (var n in Fibonacci())
{
    if (n%2 != 0) continue;
    if (n > 4000000) break;
    sum += n;
}

Multiple Yield Return Statements

You can use multiple yield return statements to emulate a collection initializer. I find this useful when I need to conditionally include elements in a collection.

IEnumerable<string> Foo() {
    yield return “First”;
    yield return “Second”;
    yield return “Third”;
}

Yield Break

yield break exits an iterator block early.

IEnumerable<string> Foo(bool exitEarly) {
    yield return “First”;
    yield return “Second”;
    if(exitEarly) yield break;
    yield return “Third”;
}

Try…Finally

You can’t yield return or yield break from inside of a finally block. This is because a finally block will not be called until the enumerator is being disposed by foreach.

IEnumerable<int> Foo() {
    try {
        yield return 1;
        yield return 2;
        yield return 3;
    } finally {
        Console.WriteLine("Finally block!");
    }
}

What does foreach really do? (.NET)

In this post I’ll explain how the C# compiler interprets the foreach keyword. I’ll start by looking at the common pattern of enumerating a list.

var numbers = new List<int>() { 1, 2, 3, 4, 5 };
foreach(var n in numbers) 
{
    Console.WriteLine(n);
}

List is consumable by foreach because it implements IEnumerable. Let’s take a look at what IEnumerable and IEnumerator define.

interface IEnumerable<T>
{
    IEnumerator<T> GetEnumerator();
}

interface IEnumerator<T>
{
    bool MoveNext();
    T Current { get; }
    void Reset();
}

Each call to GetEnumerator() will return a new instance of an enumerator. An enumerator does the heavy lifting for foreach. When an enumerator is created, it is initialized so that the first call to MoveNext() will advance the current item pointer to the first item. Calling Reset() will also return the enumerator to this state. Each subsequent call to MoveNext() will attempt to move the current pointer to the next item. MoveNext() returns true if Current points to a new item. It returns false if the end of the enumerable was reached.

The compiler utilizes IEnumerator’s members to internally rewrite a foreach loop to code that is roughly equivalent to the following:

var numbers = new List<int>() { 1, 2, 3, 4, 5 };
IEnumerator<int> enumerator = numbers.GetEnumerator();
try {
    while(enumerator.MoveNext()) {
        int n = enumerator.Current;
        Console.WriteLine(n); // The inner scope of the foreach block is embedded here
    }
} finally {
    ((System.IDisposable) enumerator).Dispose();
    // Or if no implicit conversion exists:
    System.IDisposable d = enumerator as System.IDisposable;
    if (d != null) d.Dispose();
}

There are a few things to note about this code:

  • Enumerators are read-only and forward-only.

  • foreach’s implementation does not use anonymous methods. The compiler actually rewrites the code at compile time.

  • Separate foreach statements will always consume their own instance of the enumerator.

  • The enumerator is only disposed if it implements IDisposable. If the enumerator is a sealed type, the finally block will be empty.

  • Enumeration does not have exclusive access to the underlying collection. It is inherently not thread-safe. If you need thread-safe enumeration, use a type from the System.Collections.Synchronized namespace.

  • If the underlying collection is modified (an item is added or removed), the next call to MoveNext() or Reset() will throw InvalidOperationException.

  • Enumerators cannot be passed between threads. If an enumerator detects it is being called from a different thread than it was initialized on, it will throw an exception.

  • If the underlying collection is an array, the compiler will optimize this code by using a for loop and comparing the index to the array length.

  • When iterating a multidimensional array, the rightmost index is increased first. (But don’t use multidimensional arrays in the first place.)

(The above rules are true for built-in .NET enumerators and iterator blocks. Details can be found in the C# language specification — ECMA-334: 8.8.4)

foreach can also be used on any class that implements a method named GetEnumerator(). The compiler treats classes that implement MoveNext() and Current as enumerators. This is to support .NET 1.0 code and is obsoleted by .NET 4.0’s dynamic keyword. You might see this pattern if you deal with legacy code.

Enumerating Strings

foreach can be used to enumerate a string as a char array.

foreach(var c in “foo”) {
    Console.Write(c);
    Console.Write(' ');
}
// outputs: f o o 

Watch out for multiple enumeration

When working with a raw IEnumerable (such as the result of some LINQ extension methods), it’s possible to do extra work by enumerating it multiple times. If the the enumerable is the result of database query it will query the database multiple times.

var squares = ints().Select(x => x * x);    // var is IEnumerable<int>
foreach(var s in squares) {
    Console.Write(s);
}
foreach(var s in squares) {
    // This will rerun the select statement.
    Console.Write(s);
}

To prevent this, add .ToList() or .ToArray() after the select method. This forces the enumeration to occur only once. Resharper will automatically detect multiple enumeration and warn you.

In my next post, I’ll discuss iterator blocks–a special syntax for finer-grained control over iteration.

Experiment: Writing a Memory Mapped File to Disk

For this experiment I used SysInternals’ Process Monitor to inspect how opening a file impacts writing a new memory mapped file to disk. I used C#’s FileStream and Win32’s CreateFile method with different options. After creating the file I mapped it to a view and wrote 4GB of ASCII character 137 to the file. I let the program pause for a few seconds and then disposed the file view.

I was hoping to see Windows immediately flush the file I opened with the WriteThrough and NoBuffering flags. However, in all cases I found that Windows used Fast I/O to write the file to disk in two 2GB chunks when I disposed the view. No earlier writes were observed. I was curious if Fast I/O was triggered because every single page was marked dirty or if I crossed a threshold of dirty pages. As a followup I tried writing 8kB of data at intervals of a 512kB offset in the file. With a patchwork of writes, Windows no longer used Fast I/O but called WriteFile individually for lengths varying from 32kB to 512kB (with a bias for the high end). The writes were mostly sequential with some periodically out of order.

To find the behavior I was originally seeking (immediate writes), I tried creating a file with the WriteThrough and NoBuffer flags and the writing to the FileStream directly. In this case, writing a 4kB buffer triggered a WriteFile call with the WriteThrough flag. Seeking forward in the file caused a synchronous zeroing of intermediate bytes.

Comments

Windows DevCenter describes Fast I/O as a parallel mechanism to IRPs which bypasses the file system and storage driver stack by copying data directly from the user buffer to the system cache. According to Windows Internals, the FlushFileBuffers method can be used to trigger an immediate disk write when opening a file with the WriteThrough flag.

The drive used for this experiment was an Intel SSD 240GB with the Microsoft driver (v 6.1.7600.16385). Write caching was enabled.

Results


C# FileStream

var fs = File.Open(g, FileMode.Create, FileAccess.ReadWrite, FileShare.ReadWrite);

Events:
CreateFile Desired Access: Generic Read/Write, Disposition: OverwriteIf, Options: Synchronous IO Non-Alert, Non-Directory File, Open No Recall, Attributes: n/a, ShareMode: Read, Write, AllocationSize: 0, OpenResult: Created=
SetEndOfFileInformationFile EndOfFile: 4,294,967,296
SetAllocationInformationFile AllocationSize: 4,294,967,296
...
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
CreateFileMapping SyncType: SyncTypeOther
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
ReadFile Offset: 0, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
ReadFile Offset: 32,768, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
... 32kB reads ...
ReadFile Offset: 4,294,934,528, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
FASTIO_ACQUIRE_FOR_CC_FLUSH
WriteFile Offset: 0, Length: 2,147,483,648, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
WriteFile Offset: 2,147,483,648, Length: 2,147,483,648, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
FASTIO_RELEASE_FOR_CC_FLUSH
FASTIO_ACQUIRE_FOR_CC_FLUSH
FASTIO_RELEASE_FOR_CC_FLUSH
CloseFile

Write_Through | NoBuffering

CreateFile(g,
GenericRead | GenericWrite,
Read,
IntPtr.Zero,
OpenAlways,
Write_Through | NoBuffering,
IntPtr.Zero);

Events:
CreateFile Desired Access: Generic Read/Write, Disposition: OpenIf, Options: Write Through, No Buffering, Synchronous IO Non-Alert, Non-Directory File, Attributes: n/a, ShareMode: Read, AllocationSize: 0, OpenResult: Created
ReadFile Offset: 0, Length: 0, I/O Flags: Non-cached, Priority: Normal
SetEndOfFileInformationFile EndOfFile: 4,294,967,296
SetAllocationInformationFile AllocationSize: 4,294,967,296
...
CreateFileMapping FILE LOCKED WITH WRITERS SyncType: SyncTypeCreateSection, PageProtection:
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
CreateFileMapping SyncType: SyncTypeOther
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
ReadFile Offset: 0, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
ReadFile Offset: 32,768, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
... 32kB reads ...
ReadFile Offset: 4,294,934,528, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
FASTIO_ACQUIRE_FOR_CC_FLUSH
WriteFile Offset: 0, Length: 2,147,483,648, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
WriteFile Offset: 2,147,483,648, Length: 2,147,483,648, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
FASTIO_RELEASE_FOR_CC_FLUSH
FASTIO_ACQUIRE_FOR_CC_FLUSH
FASTIO_RELEASE_FOR_CC_FLUSH
CloseFile

Normal

CreateFile(g,
GenericRead | GenericWrite,
Read,
IntPtr.Zero,
OpenAlways,
Normal,
IntPtr.Zero);

Events:
CreateFile Desired Access: Generic Read/Write, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, AllocationSize: 0, OpenResult: Created
ReadFile Offset: 0, Length: 0, Priority: Normal
SetEndOfFileInformationFile EndOfFile: 4,294,967,296
SetAllocationInformationFile AllocationSize: 4,294,967,296
...
CreateFileMapping FILE LOCKED WITH WRITERS SyncType: SyncTypeCreateSection, PageProtection:
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
CreateFileMapping SyncType: SyncTypeOther
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
ReadFile Offset: 0, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
ReadFile Offset: 32,768, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
... 32kB reads ...
ReadFile Offset: 4,294,934,528, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
FASTIO_ACQUIRE_FOR_CC_FLUSH
WriteFile Offset: 0, Length: 2,147,483,648, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
WriteFile Offset: 2,147,483,648, Length: 2,147,483,648, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
FASTIO_RELEASE_FOR_CC_FLUSH
FASTIO_ACQUIRE_FOR_CC_FLUSH
FASTIO_RELEASE_FOR_CC_FLUSH
CloseFile


Normal: Writing 8kB out of every 512kB

CreateFile Desired Access: Generic Read/Write, Disposition: OpenIf, Options: Synchronous IO Non-Alert, Non-Directory File, Attributes: N, ShareMode: Read, AllocationSize: 0, OpenResult: Created
ReadFile Offset: 0, Length: 0, Priority: Normal
SetEndOfFileInformationFile EndOfFile: 4,294,967,296
SetAllocationInformationFile AllocationSize: 4,294,967,296
...
CreateFileMapping FILE LOCKED WITH WRITERS SyncType: SyncTypeCreateSection, PageProtection:
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
CreateFileMapping SyncType: SyncTypeOther
FASTIO_RELEASE_FOR_SECTION_SYNCHRONIZATION
ReadFile Offset: 0, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
ReadFile Offset: 2,097,152, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
... 32kB reads ...
ReadFile Offset: 2,145,386,496, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
WriteFile Offset: 0, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
WriteFile Offset: 2,097,152, Length: 32,768, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
WriteFile Offset: 32,768, Length: 524,288, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
.... mostly 512kB writes in no particular order, but various other sizes too ...
WriteFile Offset: 2,144,894,976, Length: 491,520, I/O Flags: Non-cached, Paging I/O, Synchronous Paging I/O, Priority: Normal
CloseFile

Project Euler: my solutions

I started working on the Project Euler problems tonight. I just posted my solutions (in C#) to problems 1-7 on Github.

I’m sort of regretting doing these in C#. I may switch to a functional language like F# to clean up the syntax. So far most of the problems are pretty trivial to brute force (eg. it only takes about 5 seconds to calculate the 10,001st prime on my computer). I believe the spirit of the problems are to require optimization so I might switch to C so I can focus on benchmarking my solutions. The JIT compiler and garbage collection in C# and F# make them difficult languages to accurately benchmark.

The most interesting problem so far was #4. The goal is to find the largest palindromic number that’s the multiple of two 3 digit numbers. There are only 998,001 multiples (and half are duplicates) so you could just calculate them all and sort them, but instead I decided to traverse a matrix in descending order to limit CPU time. To find the pattern I created the matrix in Excel and applied a color scale.

From here it’s obvious the correct traversal order to get the largest multiples first is diagonally starting from the bottom left.

The key is to observe that the sum of the column and row indexes are constant on the diagonal. The traversal algorithm then looks like:

public static IEnumerable<T> Traverse<T>(int rows, int cols, Func<int, int, T> generator)
{
var stripes = rows + cols - 1;
for (int stripe = 1; stripe < stripes; stripe++)
{
for (int col = 1; col <= stripe; col++)
{
for (int row = stripe; row > stripe-col; row--)
{
if ((col + row == stripe + 1) && (col >= row))
{
yield return generator(row-1, col-1);
}
}
}
}
}

This only computes the matrix cells that the caller asks for. Quite a neat trick.

Edit: I added some non-LINQ solutions which are much faster than the original solutions. I still prefer writing the original solution in LINQ and then optimizing.

Searching Big Data

Recently Ayende posted a sample interview question about searching large datasets. Here is my first attempt at solving it.

We have a 15TB csv file that contains web log, the entries are sorted by date(since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.

Since we’re dealing with a huge dataset obviously we can’t just scan through it. I’m going to use a binary search. I adjusted the problem slightly to assume that I’m reading an IIS log file (which I have handy) instead of a CSV file. Each line of the IIS log file is either blank, starts with # to denote a header, or starts with a timestamp in the yyyy-MM-dd HH:mm:ss format.

When I outlined the algorithm I wanted my match found condition to be startOfSearchRegion == endOfSearchRegion, but since I always adjust the offset to the beginning of the line the code came out cleaner to just track the last two offsets and compare them instead of accounting for endOfSearchRegion always ending up one line after startOfSearchRegion. I feel like I should be able to clean up this logic if I reviewed it a bit closer.

This was my first experience with seeking in a file stream so there were a few hiccups I encountered. I did not realize StreamReader is buffered. Instead of using Seek() I tried using BaseStream.Position, but that only resulted in reading gibberish. I had to call DiscardBufferedData() after each seek operation, which according to MSDN is a slow operation. Moving backwards in bigger steps could have allowed me to take advantage of the buffering. Perhaps there is a better stream implementation I should be using (or FileStream directly).

After writing my solution I went back and reviewed Ayende’s solution to see how I could improve mine. I found two areas where our solutions are similar but I prefer his.

First, instead of parsing to a DateTime to do a comparison, I could have leveraged the yyyy-MM-dd HH:mm:ss format to do a direct string comparison.

if(dateString.CompareTo(startString) < 0){
startOfSearchRegion = offset
} else {
endOfSearchRegion = offset;
}

Second, when scanning backwards in the file I should have taken advantage of reading a block of bytes into a buffer instead of seeking 1 byte at a time (as I suspected). This would also allow the solution to support multibyte encodings instead of just ASCII. See here for his code.

I thought this was a fun problem. I’ll definitely be looking for similar ones to sharpen my algorithm writing skills.