Frugal Cafe
Posts
FC58: O(N) data structure -- slowly growing O(N^2) algorithm

FC58: O(N) data structure -- slowly growing O(N^2) algorithm

When memory gets larger, garbage collection becomes more and more expensive

Feng Yuan
September 13, 2023

In FC56; The deadly Linq.Enumerable.GroupBy you do not need and can't afford (beehiiv.com), we discussed DebugDiag’s memory dump analysis algorithm which uses Linq.Enumerable.GroupBy. Apparently, this should be O(N) algorithm. But if you time the process of data injection, you can easily see data processing is getting slower and slower.

Here is the first a few millions of objects:

Time to inject 1 million object is around 300 ms. Here is last a few millions:

Time to inject 1 million object is around 5,000 ms, 15x more expensive.

The reason behind the gradually slow processing is memory is growing proportionally to data volume. The ratio is around 60 bytes per each object.

We have a O(N) data structure design here. When memory gets larger and larger, garbage collection becomes more and more expensive, making the whole algorithm a slow growing O(N²) algorithm.

As comparison, here is the same progress report for our modified algorithm:

Data structure size is almost flat at only 1.53 mb, because we only need to aggregate statistics per type, processing time for million object is around 85 ms. That is the power of O(1) data structure design.

Of course, O(1) data structure design is only possible for such simple analysis. For more complex analysis, you need to have O(N) data structure. How do you design efficient O(N) data structure? Here are three metrics to consider:

Average memory usage per node
Object count per node
Object reference count per node.

A good design is one which keeps all three small.

Here is the main object used in DebugDiag 2.2’s implementation:

ClrObject is 48 bytes, with two pointers inside. There is another pointer pointing to it. So average memory usage per node is 52 bytes, plus space taken by free objects, object count per node is 1, object reference count per node is 3. Not an efficient data structure design.

Here is struct found in Visual Studio’s memory analyzer:

This is used to record object references. Each struct is 16 byte. That is quite costly.