A lesson in being too clever

For a project I’m working on I need to move a dataset from a remote server to a local server for analysis. The dataset may contain several million rows.

My test dataset is 1.1 million rows with three fields of interest. (The real dataset will have many more fields.) All fields are assumed to be positive.

  • Field 1: an 8 digit integer, repeating or increasing slowly
  • Field 2: an integer likely between 1 and 20
  • Field 3: a decimal number with two digits of precision likely below 10,000

We move a lot of data around in JSON format so I started by implementing my transfer in JSON. I knew it would be an unacceptably large file. It was 62 MB. Next I gzipped the file to see how much that helped. My mental heuristic on gzipping JSON is you get about a 90% reduction. This large file did better at 94% and weighed in at 4.5 MB. Not too bad.

Then I got clever.

Since the decimal field only has two digits of precision I multiplied it by 100 and treated it as an integer. Now with three integers, I transformed my data into a byte array (14 MB) and compressed that (3.2 MB). Hah I beat JSON compression!

Now I know that gzip has better performance on repeated data and I knew the first field in the dataset was a slowly increasing number. Most records are the same as or between 1 and 19 numbers higher than the previous one. To take advantage of this fact, I transformed the first field into a delta. So if the data was this:

  • 10191378
  • 10191385
  • 10191385
  • 10191392
  • 10191408

After my transform the dataset became:

  • 10191378
  • 7
  • 0
  • 7
  • 16

After compressing this transformed dataset, it was only 1.87 MB. Exciting!

Then I realized what I should have realized at the start. I have a tabular dataset and there’s a well known format that already exists for that: CSV.

So I rendered my dataset as CSV (25 MB) and compressed it (3.4 MB). That’s definitely better than JSON and very comparable to the byte array.

Then I applied my delta formula to the CSV and recompressed it. The result: 1.83 MB. That’s about 40 KB smaller than my “clever” solution. And CSV is far more adaptable than byte arrays. I wouldn’t even need to transform decimals into integers.

So my lesson learned is the widely used file formats are worth using after all.

A small amount of time invested in thinking about the constraints of this test dataset reduced the compressed file size almost 50%, but I suspect on the real dataset with more columns the gains from applying a delta transform to one column will become much less significant.