A lesson in being too clever

For a project I’m working on I need to move a dataset from a remote server to a local server for analysis. The dataset may contain several million rows.

My test dataset is 1.1 million rows with three fields of interest. (The real dataset will have many more fields.) All fields are assumed to be positive.

  • Field 1: an 8 digit integer, repeating or increasing slowly
  • Field 2: an integer likely between 1 and 20
  • Field 3: a decimal number with two digits of precision likely below 10,000

We move a lot of data around in JSON format so I started by implementing my transfer in JSON. I knew it would be an unacceptably large file. It was 62 MB. Next I gzipped the file to see how much that helped. My mental heuristic on gzipping JSON is you get about a 90% reduction. This large file did better at 94% and weighed in at 4.5 MB. Not too bad.

Then I got clever.

Since the decimal field only has two digits of precision I multiplied it by 100 and treated it as an integer. Now with three integers, I transformed my data into a byte array (14 MB) and compressed that (3.2 MB). Hah I beat JSON compression!

Now I know that gzip has better performance on repeated data and I knew the first field in the dataset was a slowly increasing number. Most records are the same as or between 1 and 19 numbers higher than the previous one. To take advantage of this fact, I transformed the first field into a delta. So if the data was this:

  • 10191378
  • 10191385
  • 10191385
  • 10191392
  • 10191408

After my transform the dataset became:

  • 10191378
  • 7
  • 0
  • 7
  • 16

After compressing this transformed dataset, it was only 1.87 MB. Exciting!

Then I realized what I should have realized at the start. I have a tabular dataset and there’s a well known format that already exists for that: CSV.

So I rendered my dataset as CSV (25 MB) and compressed it (3.4 MB). That’s definitely better than JSON and very comparable to the byte array.

Then I applied my delta formula to the CSV and recompressed it. The result: 1.83 MB. That’s about 40 KB smaller than my “clever” solution. And CSV is far more adaptable than byte arrays. I wouldn’t even need to transform decimals into integers.

So my lesson learned is the widely used file formats are worth using after all.

A small amount of time invested in thinking about the constraints of this test dataset reduced the compressed file size almost 50%, but I suspect on the real dataset with more columns the gains from applying a delta transform to one column will become much less significant.

Scripting database changes

On my team we use SQL Server Database Projects to version our databases. We previously used EF6 for our data layer, but recently started using EF Core which gave me the opportunity to optimize my workflow. Here are the steps we would manually take to modify the schema.

  1. Open the database project and alter the schema.
  2. Build the project.
  3. Publish the project to a local test database.
  4. Make the corresponding change in the EF Core model.

I decided to script it and add the script as a shortcut in Launchy. Here is my script.

"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Current\Bin\msbuild.exe" C:\code\prism\DB\PrismDB.sln /property:Configuration=Release

if %errorlevel% neq 0 exit /b %errorlevel%

"C:\Program Files\Microsoft SQL Server\150\DAC\bin\SqlPackage.exe" /Action:Publish /SourceFile:"C:\path\to\sql\project\bin\Release\MyDatabase.dacpac" /TargetServerName:localhost /TargetDatabaseName:MyDatabase /p:AllowIncompatiblePlatform=True /p:BlockOnPossibleDataLoss=False /TargetTimeout:120 /p:ScriptDatabaseOptions=False

if %errorlevel% neq 0 exit /b %errorlevel%

pushd C:\path\to\sql\project

dotnet ef dbcontext scaffold "Server=localhost;Database=MyDatabase;Trusted_Connection=True;" Microsoft.EntityFrameworkCore.SqlServer --output-dir ../Data/Entities --context Entities --force --project Data.Dummy

if %errorlevel% neq 0 exit /b %errorlevel%

My workflow with the script is:

  1. Open the database project and alter the schema.
  2. Press Alt+Spacebar and type database to execute my script.
  3. Go back to coding in Visual Studio and everything is up to date.

To improve the error experience, you can create a second script that calls the above script with the /k switch to prevent the command shell from closing at the end.

cmd /k "Database Run All Inner.bat"

Custom metric in Application Insights

Tracking custom metrics in Application Insights is easy. I wanted to track how long our cash register takes to print receipts so I could compare performance across hardware and make better recommendations to our sales team and diagnose customer issues related to printing speed.

You will need a TelemetryClient instance. Use the GetMetric() method to get or create a metric by name. You can use the overloads to provide names for additional custom dimensions. In this case I am tracking the receipt number and the number of images printed on the receipt.

Call TrackValue() to add a new measurement. The TelemetryClient will aggregate the metrics over time and report them to Application Insights.
The default interval appears to be 54 seconds.

In my case, aggregation is not doing much since each printed receipt has unique dimensions and a register is not likely to print more than one receipt every 54 seconds.

var metric = _telemetry.GetMetric("PrintReceiptDurationMs", "ReceiptNumber", "ImageCount");
metric.TrackValue(sw.ElapsedMilliseconds, receipt.ReceiptNumber, imageCount.ToString());

In Log Analytics you can now query for the results.

| where name == 'PrintReceiptDurationMs'
| extend receipt_number = tostring(customDimensions.ReceiptNumber)
| extend image_count = todouble(customDimensions.ImageCount)
| project value, receipt_number, image_count
| order by receipt_number desc

Or you could plot a chart.

| where name == 'PrintReceiptDurationMs'
| summarize avg(value) by todouble(customDimensions.ImageCount)
| render barchart

You can query across all metrics if they share common custom dimensions.

| where customDimensions.ReceiptNumber == 'RC-00092261-7'
| project name, value, timestamp 
| order by name

Prototype: Generating Vue forms from JSON

I’m fascinated with generated code. I love to write code, but when it comes to repetitive CRUD screens, nothing beats a template. Being able to quickly generate screens builds confidence with clients and gets you right into the meat of the application.

I used to build applications primarily in ASP.NET MVC. Recently I’ve started using Vue and immediately missed having input and form builders. Since I still use a C# Web API on the back end, I had to creatively get the C# model from server to client. I did this using a modified JSON Schema. I tried several libraries, but was not very happy with the extensibility of any of them. This proof of concept uses NJsonSchema.

You’re probably here for the client side. Here’s a demo. The code is in the second and third tab.

This form requires two objects from the server. A schema object describing the form and an object containing the form data.

The type attribute comes from the C# property type. When I was limited by NJsonSchema, I added an xtype attribute so I could pick how to render DateTimes and option lists. Select list options come from a property on the formData object mapped from optionsFromProperty in the schema.

You can find the (currently) very ugly server side model here:


For simplicity I published the demo as a single component, but I did break it into several components in my own code.

I will probably end up writing my own schema generator so I’m not constrained by the assumptions of existing ones. JSON Schema is designed for validating JSON data, not building UIs so I’m really stretching the use case here. I would prefer to use the DataAnnotations attributes whenever possible since many tools, like EF, Newtonsoft, and ASP.NET data validation, are already capable of generating and interpreting them.

I couldn’t generate enum drop downs in this demo because NJsonSchema renders them as $ref properties which I didn’t want to interpret client-side.

It would also be great to have sensible default attributes so you can build a form directly from a plain class or EF table object without manually defining labels and enum/list data types.

In a production build scenario, you could precompile the schema as a dependency json file so only the form data is downloaded at run-time.

Thanks for reading! Let me know what features would be useful to you.

Discovering connections in code via Reflection, part 2

Ayende has a really neat post about about using an AST visitor to generate workflow diagrams from code. I used that as inspiration to modify my previous pub/sub documentation generator to output GraphViz syntax. It was a trivial change.

Console.WriteLine($"{messageUsage.Publisher} -> {handler}");

I copy the output into the demo at
https://www.planttext.com/ and it generates a diagram of all the messages passed in my application.

In the future, I may import the GraphViz nuget package and generate the diagram inside my own code.

A Start menu alternative

I recently installed Launchy on my machine to automate common actions. Launchy is a program that lets you quickly execute shortcuts via the keyboard. To activate it, press Alt+Spacebar then type your shortcut. It has autocomplete and defaults to the last shortcut you ran.

It indexes your Start Menu and Quick Launch. I created an additional index and added frequently visited Chrome bookmarks as well as some batch and Powershell scripts that I regularly use.

Some of my shortcuts:

  • JIRA (link to current sprint)
  • Backlog (link to backlog)
  • Pull Requests
  • Prod Insights (my production activity monitor)
  • Script Out DBs (batch file regenerates EF Core files from a SQL DB)
  • Vue Debug (launches a background Vue compiler and watches for changes)
  • ES6 cheat sheet

Discovering connections in code via Reflection

I wrote a .NET application that makes heavy use of the publish/subscribe pattern. In order to help other developers learn about the code base I wrote a unit test that finds all publishers and subscribers and describes how they are connected.

Each published message is a class inheriting from IMessage.

Each subscriber inherits from ISubscribeTo<TheMessageType>.

This code uses an IL reflector (source code here) to find each location a message type is constructed (before it’s published) and type reflection to find all its subscribers. Then it builds a text document describing what methods publish each message type and what subscribes to it.

The output looks like this. One improvement would be to remove the return type from the method signature so it reads more naturally.

{Method} publishes  ==>  {message type}
	-> handled by {subscriber type}

AccountViewModel.Void Execute_NewAccountSelectedCmd()  ==>  AccountSelected
	-> CustomerDetailViewModel
AddCustomerAccountsViewModel.Void Execute_CloseCmd()  ==>  AccountSelected
	-> CustomerDetailViewModel
AppliedTenderViewModel.Void Execute_RemoveTenderCmd()  ==>  RemoveTender
	-> TransactionViewModel
AuthorizationService.User ValidateUser(System.String, System.String)  ==>  LogoutRequested
	-> RegisterViewModel
	-> TransactionViewModel
	-> HardwareService
BasketIdViewModel.Void Execute_ApplyCmd()  ==>  ApplyBasketId
	-> TransactionViewModel

The code:

public void ListOfAllPublishersAndSubscribers()
	Console.WriteLine("{Method} publishes  ==>  {message type}");
	Console.WriteLine("\t-> handled by {subscriber type}");
	Console.WriteLine("Discovered via Reflection. Duplicates not removed.");

	var domain = typeof(App).Assembly;
	var pos = typeof(TransactionViewModel).Assembly;
	var assemblies = new List<Assembly>() { domain, pos };

	var handlerType = typeof(ISubscribeTo<>);
	var handlersByType = assemblies
		.SelectMany(s => s.GetTypes())
		.SelectMany(s => s.GetInterfaces(), (t, i) => new { Type = t, Interface = i })
		.Where(p => p.Interface.IsGenericType && handlerType.IsAssignableFrom(p.Interface.GetGenericTypeDefinition()))
		.GroupBy(t => t.Interface.GetGenericArguments().First().Name)
		.ToDictionary(g => g.Key, g => g.Select(x => x.Type.Name));

	var imessage = typeof(IMessage);
	foreach (var messageUsage in assemblies
		.SelectMany(s => s.GetTypes())
		.Where(type => type.IsClass)
		.SelectMany(cl => cl.GetMethods().OfType<MethodBase>(), (t, mb) => new { t, mb })
		.SelectMany(a => MethodBodyReader.GetInstructions(a.mb), (a, i) => new { Publisher = $"{a.t.Name}.{a.mb.ToString()}", op = i.Operand as ConstructorInfo })
		.Where(a => a.op != null)
		.Where(a => imessage.IsAssignableFrom(a.op.DeclaringType))
		.OrderBy(a => a.Publisher)
		.ThenBy(a => a.op.DeclaringType.Name))
		Console.WriteLine($"{messageUsage.Publisher}  ==>  {messageUsage.op.DeclaringType.Name}");
		if (handlersByType.ContainsKey(messageUsage.op.DeclaringType.Name))
			foreach (var handler in handlersByType[messageUsage.op.DeclaringType.Name])
				Console.WriteLine($"\t-> {handler}");
			Console.WriteLine("\t-> NO HANDLERS");

Updating disconnected Entity Framework child collections

One pain point I have with Entity Framework is dealing with updating child collections on disconnected entities. The most common scenario I run into is in web APIs. I have a web page which allows a user to edit an entity which includes a child collection. New children can be added, existing children edited, and existing children deleted. All the actions are saved to the server at once. When I POST this to my API, I end up writing a lot of boilerplate to figure out what changed. To prevent that, I came up with this method. A sample usage is below. This is for EF 6.1.

We Are Generation Earth

I stumbled across the BBC documentary mini-series Supersized Earth on Netflix and it’s really quite fascinating. Host Dallas Campbell explores how humans have changed the face of the Earth over the past 100 years by visiting some of the largest engineering projects around the world. They are just mind boggling. In Hong Kong, over 3.5 million people live above the fourteenth floor. That’s like lifting the entire city of Chicago into skyrises. Our open face mines dive even deeper into the earth than our cities rise. We have dammed over 1/3 of the world’s river flow capacity. And our cities don’t flood because we can divert  rivers through underground caverns with pumps that could drain a swimming pool in a second.

The pace of change is increasing too. In 1936 Hoover Dam was the tallest dam in the world. Today it doesn’t even make it in the top 25. The South To North aquifer under construction in China, designed to relieve water shortages in the north, will be one of the longest rivers in the world–longer the the width of the continental US. China is also leading highway construction. In the last 20 years they’ve built more highways than exist in the US.

Another fascinating feat: a boat designed to transport the untransportable. Campbell visits the Blue Marlin which is preparing to transport an oil rig across the Pacific Ocean. Because the oil rig cannot be lifted, the Blue Marlin must sink 10 meters underwater to scoop it up.

Overall the documentary is very well produced, with slick animations woven with satellite images and some very impressive views. Campbell keeps it interesting too, undertaking some challenges at each stop, like downhill bike racing, cleaning windows on the world’s tallest building, and detonating explosives at a mine. It’s since been removed from Netflix, but you can still see parts of it on Youtube.

Episode 1 (can’t find it), Episode 2Episode 3

Measure What Matters To Customers

In Measure What Matters To Customers, Ron Baker challenges several common notions held by professional service firms including “costs drive prices” and “productivity can be measured by timesheets”. Too many firms, Baker says, are focused on optimizing production and lowering costs to the detriment of effectively serving their customers.

To be successful in today’s information economy, executives must shift their focus to ensuring the success of customers. In this new model, executives must increase the firm’s intellectual capital, price, and effectiveness. Baker advocates developing firm-wide Key Predictive Indicators–forward looking predictors of customer success, not backwards looking performance measures. If you are helping your customers be successful, it’s likely you will be as well. KPIs should be generated by hypothesis and periodically tested. If a KPI isn’t actually predicting your firm’s outcomes, go back to the whiteboard.

Baker presents Gordon Bethune’s transformation of Continental Airlines as an example of the new business model. In the 90’s, Continental was a budget airline so cheap nobody wanted to fly on it. It ranked last in all performance measures for airlines. All efforts had been made to reduce the cost per seat mile traveled. Bethune shifted the focus to customer metrics: on-time arrivals, luggage lost, and complaints received. The airline quickly won more customer satisfaction awards than any other airline in the world and the stock priced increased 25X.

Baker also discusses the rise of the intellectual worker. He regards the timesheet as a remnant of Taylorism. Knowledge workers are not like workers of the industrial revolution. They are paid for their ideas, not hours worked. Setting billable hour quotas is demoralizing. Knowledge workers should be, at least in part, compensated for the results they produce in the form of bonuses or stock options.

Without timesheets, how should services be billed for? Simple. Set the price of the service relative to its value to the customer. With a price set upfront, the firm can tailor its services’s cost appropriately. Decoupling the cost from hours works can lead to innovation within the company. By taking on the risk of a fixed price contract, the firm gains the ability to earn far more than margin on labor.*

I recommend this book to every professional services manager. Baker provides insight into where some of our widely held beliefs originated. I’m confident following his advice will help other’s find a profitable future serving others.

*For more on this topic, see his book Implementing Value Pricing.