Searching Big Data

Recently Ayende posted a sample interview question about searching large datasets. Here is my first attempt at solving it.

We have a 15TB csv file that contains web log, the entries are sorted by date(since this is how they were entered). Find all the log entries within a given date range. You may not read more than 32 MB.

Since we’re dealing with a huge dataset obviously we can’t just scan through it. I’m going to use a binary search. I adjusted the problem slightly to assume that I’m reading an IIS log file (which I have handy) instead of a CSV file. Each line of the IIS log file is either blank, starts with # to denote a header, or starts with a timestamp in the yyyy-MM-dd HH:mm:ss format.

When I outlined the algorithm I wanted my match found condition to be startOfSearchRegion == endOfSearchRegion, but since I always adjust the offset to the beginning of the line the code came out cleaner to just track the last two offsets and compare them instead of accounting for endOfSearchRegion always ending up one line after startOfSearchRegion. I feel like I should be able to clean up this logic if I reviewed it a bit closer.

This was my first experience with seeking in a file stream so there were a few hiccups I encountered. I did not realize StreamReader is buffered. Instead of using Seek() I tried using BaseStream.Position, but that only resulted in reading gibberish. I had to call DiscardBufferedData() after each seek operation, which according to MSDN is a slow operation. Moving backwards in bigger steps could have allowed me to take advantage of the buffering. Perhaps there is a better stream implementation I should be using (or FileStream directly).

After writing my solution I went back and reviewed Ayende’s solution to see how I could improve mine. I found two areas where our solutions are similar but I prefer his.

First, instead of parsing to a DateTime to do a comparison, I could have leveraged the yyyy-MM-dd HH:mm:ss format to do a direct string comparison.

if(dateString.CompareTo(startString) < 0){
startOfSearchRegion = offset
} else {
endOfSearchRegion = offset;

Second, when scanning backwards in the file I should have taken advantage of reading a block of bytes into a buffer instead of seeking 1 byte at a time (as I suspected). This would also allow the solution to support multibyte encodings instead of just ASCII. See here for his code.

I thought this was a fun problem. I’ll definitely be looking for similar ones to sharpen my algorithm writing skills.