• Frugal Cafe
  • Posts
  • FC50: Segmented data structure is for incremental growth

FC50: Segmented data structure is for incremental growth

Inefficient Roslyn SegmentedDictionary

I’m about to implement a segmented hash container. But before doing that, I searched for SegmentedDictionary online. Found an implementation in PerfView source code. But its original code is from Roslyn code base, so I found this version: roslyn/src/Dependencies/Collections/SegmentedDictionary`2.cs at 1cabbfeac0719ba84689348cc8db2a7a167ae14b · dotnet/roslyn (github.com)

Let’s check its resizing implementation:

The code is allocating a new segmented array for entries based on new size, copying data over. It then allocates a new segmented array for buckets; and regenerate the buckets.

This is inefficient implementation, because it’s more than doubling memory usage whenever resizing is needed. Memory usage is exactly the same as non-segmented dictionary.

Segmented data structure is for incremental growth, 2x growth is the anti-pattern to avoid with segmented data structure.

But hash container has something special, geometry growth is needed for the bucket array, because we need to use a large prime number and bucket array access is random. The entry array is different, data is added to it one by one.

Even for the bucket array, there is no need to allocate a new array doubling its original size. You just need to add the same capacity.

The entry array does not need resizing at this moment. More segments can be added when new data is going to be added. It should be segmented list instead.

With such incremental resizing approach, resizing is now much cheaper. We could even change the 2x growth policy. For example, we can reduce it to 1.5x. As long as we’re using geometry growth, adding N elements will have O(N) performance.

As such, the only advantage of using such segmented dictionary is avoiding LOH allocations. Overall memory usage is little big higher.

If you have a segmented hash container implementation in your code path, check its resizing implementation.