FC34 Undocumented StringBuilder

Understanding the internal data structure of StringBuilder and its issues

In the 1990s I was a big fan of authors like Andrew Schulman, David Maxey, Matt Pietrek who wrote Undocumented Windows (Undocumented Windows: A Programmers Guide to Reserved Microsoft Windows Api Functions (The Andrew Schulman Programming Series/Book and Disk): Schulman, Andrew, Maxey, David, Pietrek, Matt: 9780201608342: Amazon.com: Books). The more useful thing taught me by those books is how to conduct experiments on my own to investigate how software really works. Such techniques served me well when writing my own book later (https://www.amazon.com/Windows-Graphics-Programming-Hewlett-Packard-Professional/dp/0130869856), and my performance work much later.

We have been discussing StringBuilder in recent posts. It seems not many people know how it’s implemented and what are the possible performance issues you need to pay attention to. Let’s discuss how StringBuilder really works here.

If you create a StringBuilder using its default constructor, a 16-character array will be allocated, serving as its buffer. When more characters are added, the data structure needs to resize to be able to store the data and cater for possible future growth. Instead of having a single buffer and resize every time as in MemoryStream, or List, StringBuilder has a different strategy for resizing. It allocates a new StringBuilder with a large capacity and then forms a linked list of builders. Every StringBuilder in the chain points to the previous builder, while the main builder points to the last build, where insertion is happening.

The capacity for the newly allocated StringBuilder is keep rising but maxed out at 8,000 characters. There is no chance of the buffers inside to be large enough to land in large object heap, if small chunks of data are appended every time.

But there are three ways those buffers could land in LOH:

  1. When you specify a large initial capacity. A single buffer will be allocated. If capacity is larger than 85000 / 2, then the buffer is definitely in LOH.

  2. When a large piece of data is inserted/appended, for example a huge string. A single StringBuilder will be allocated.

  3. When you call StringBuilder.Clear. When a StringBuilder has multiple builders inside, calling Clear method will force it to morph to a single StringBuilder with a single buffer combining the capacity of all builders inside.

Here is code handling the first code path:

Here is the code handling the second code path:

newBlockLength could be a large number.

Here is code handling the third code path:

Here is how to avoid LOH allocations in all three cases:

  1. For constructor call, use Math.Min(42000, capacity) instead.

  2. For appending calls, split large data. We talked about extension methods here: FC32 Loads of StringBuilder Extension Methods (beehiiv.com).

  3. For Clear call, this is only an issue when StringBuilder is not reused properly. But normally StringBuilder cache do not like large builders. So have a few reused large StringBuilder objects.

So LOH allocations in StringBuilder is an avoidable problem.