- Frugal Cafe
- Posts
- FC56; The deadly Linq.Enumerable.GroupBy you do not need and can't afford
FC56; The deadly Linq.Enumerable.GroupBy you do not need and can't afford
Proper data structure design needed for large data processing
Our half million object challenge exposed issues in PerfView and DebugDiag. PerfView trims the data to 128 million objects only. DebugDiag (2.2) crashed in a Linq.GroupBy expression. Here is the code:
The input is objects generated by managed heap enumeration, more than 500 million objects here. Linq.GroupBy is called aggregate the data by type. But there are 500 million boxed integers here. So the group for that is huge, causing Array dimension exceeded support range exception. Linq.GroupBy basically generates a multiple value dictionary.
But if you read the code carefully, it does not really need to hold on to millions of objects. It just needs object count and sum of object sizes.
I captured a dump from the crash and generated a simple stat:
Out of total 500 million objects, 128 × 1024 × 1024 have been processed. The array holding them in GroupBy implementation is now at 1 gb + 24 byte in size. The next resizing would through OOM exception, but there seems to be a flag to enable large array allocation.
We can change the code to:
This code can process the half billion objects dump in 45 seconds with only 380 kb total allocation. That is the difference between LINQ.Enumerable’s total disregard of proper data structure design and proper design.
Here I’m using ClrMD 0.9.2 to match the version used by DebugDiag 2.2. If memory serves me right, I think I fixed the issue in its official source code repo. The next version should be much faster, but I can’t get it to work on my machine.
If you want to play with dump analysis, here is the source code: FrugalCafe/HeapWalker/Program.cs at master · ProfFrugal/FrugalCafe (github.com)