HashSet expansion mechanism wastes much more time and space than you think

Keywords: Programming Redis

One: Background

1. Storytelling

Since this pure memory project entered the big client, I am now very sensitive to memory and CPU, run a little bit of data memory up and down a few G s, especially insecure, always want to grab a few dump s with windbg to see which one is caused, is my code or my colleague's code?Many old friends who have read my blog always leave messages to let me present a set of windbg series or videos. I can't help it. People are flying in the river, sooner or later they have to get several knives or flower shelves.πŸ˜„πŸ˜„πŸ˜„In a nutshell, let's take a look at how HashSet has been expanded.

2: Extension mechanism of HashSet

1. How to view

The best way to learn how to scale up is to look at the underlying source of the HashSet, and the roughest entry point is HashSet.Add Method.

As you can see from the diagram, the final initialization was done with Initialize, and there is such a magical line of code in it: int prime =HashHelpers.GetPrime(capacity);, literally, it means to get a prime number. Ha-ha, that's a little fun. What is a prime number?Simply put, a number that can only be divided by 1 and itself is called a prime number. That's curiosity. Let's see how prime numbers work together.Take another screenshot.

From the diagram, the bottom of the HashSet defines 72 prime numbers by default to speed up. The largest one is 719w. In other words, when the number of elements is greater than 719w, only the IsPrime method can be used to dynamically calculate the prime numbers, as follows:


public static bool IsPrime(int candidate)
{
	if ((candidate & 1) != 0)
	{
		int num = (int)Math.Sqrt(candidate);
		for (int i = 3; i <= num; i += 2)
		{
			if (candidate % i == 0)
			{
				return false;
			}
		}
		return true;
	}
	return candidate == 2;
}

After reviewing the whole process, I think you should understand that when you first Add, the default space consumption is the smallest prime number 3 of 72 predefined. Friends who have read my previous articles know that the default size of List is 4, followed by a simple and rough * 2 process, as shown in the following code.


private void EnsureCapacity(int min)
{
	if (_items.Length < min)
	{
		int num = (_items.Length == 0) ? 4 : (_items.Length * 2);
	}
}

2. HashSet Secondary Expansion Exploration

When the number of HashSet s reaches 3, it's obvious that there will be a second expansion, unlike List, which is done using an EnsureCapacity method, and then take a closer look at how it can be expanded.


public static int ExpandPrime(int oldSize)
{
	int num = 2 * oldSize;
	if ((uint)num > 2146435069u && 2146435069 > oldSize)
	{
		return 2146435069;
	}
	return GetPrime(num);
}

As you can see from the diagram, the final extension is done in the ExpandPrime method. The process is to first * 2, then take a prime number closest to the upper limit, which is 7, and then use 7 as the new Size for HashSet. If you have to see the demonstration, I will write a short code to prove it, as shown below:

2. Do you smell the risk?

<1>Time Risks

For demonstration purposes, I show the last of 72 predefined prime numbers.


public static readonly int[] primes = new int[72]
{
	2009191,
	2411033,
	2893249,
	3471899,
	4166287,
	4999559,
	5999471,
	7199369
};

That is, when the number of elements in the HashSet is 2893249, the trigger expansion becomes 289 3249 * 2 => 5786498, the closest prime number is: 5999471, that is, 289w has surged to 599w, which is 599w -289w = 310w, which is more than twice the increase, scary?Write a code below to verify.


        static void Main(string[] args)
        {
            var hashSet = new HashSet<int>(Enumerable.Range(0, 2893249));

            hashSet.Add(int.MaxValue);

            Console.Read();
        }

0:000> !clrstack -l

000000B8F4DBE500 00007ffaf00132ae ConsoleApplication3.Program.Main(System.String[]) [C:\4\ConsoleApp1\ConsoleApp1\Program.cs @ 16]
    LOCALS:
        0x000000B8F4DBE538 = 0x0000020e0b8fcc08
0:000> !DumpObj /d 0000020e0b8fcc08
Name:        System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]]
Size:        64(0x40) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\5.0.0-preview.5.20278.1\System.Collections.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffaf0096d10  4000017        8       System.Int32[]  0 instance 0000020e2025e9f8 _buckets
00007ffaf00f7ad0  4000018       10 ...ivate.CoreLib]][]  0 instance 0000020e2bea1020 _slots
00007ffaeffdf828  4000019       28         System.Int32  1 instance          2893250 _count
0:000> !DumpObj /d 0000020e2025e9f8
Name:        System.Int32[]
Size:        23997908(0x16e2dd4) bytes
Array:       Rank 1, Number of elements 5999471, Type Int32 (Print Array)
Fields:
None


And most importantly, this is a one-time expansion, not a gradual expansion as achieved in redis, and the time cost is noteworthy.

<2>Spatial Risks

What's the risk?Here's a look: 289w and 599w are the two HashSet s that take up the most space, which is also my most sensitive.


        static void Main(string[] args)
        {
            var hashSet1 = new HashSet<int>(Enumerable.Range(0, 2893249));

            var hashSet2 = new HashSet<int>(Enumerable.Range(0, 2893249));
            hashSet2.Add(int.MaxValue);

            Console.Read();
        }

0:000> !clrstack -l
OS Thread Id: 0x4a44 (0)
000000B1B4FEE460 00007ffaf00032ea ConsoleApplication3.Program.Main(System.String[]) [C:\4\ConsoleApp1\ConsoleApp1\Program.cs @ 18]
    LOCALS:
        0x000000B1B4FEE4B8 = 0x000001d13363cc08
        0x000000B1B4FEE4B0 = 0x000001d13363d648

0:000> !objsize 0x000001d13363cc08
sizeof(000001D13363CC08) = 46292104 (0x2c25c88) bytes (System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]])
0:000> !objsize 0x000001d13363d648
sizeof(000001D13363D648) = 95991656 (0x5b8b768) bytes (System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]])

You can see that hashSet 1 takes up 46292104 / 1024 / 1024 = 44.1M, hashSet2 takes up 95991656 / 1024 / 1024 = 91.5M, which is wasted: 91.5 - 44.1 = 47.4M.

If you really think you just wasted 47.4M, you're making a big mistake. Don't forget that the bottom layer used a new size to cover the old size while expanding, and this old size collection always takes up space on the heap when the GC hasn't been recycled. Can you understand that?As follows:

To verify, you can grab Slot[] m_on the unmanaged heap with WinDbgSlots and int[] m_buckets two arrays, and I've modified the code as follows:


    static void Main(string[] args)
    {
        var hashSet2 = new HashSet<int>(Enumerable.Range(0, 2893249));
        hashSet2.Add(int.MaxValue);
        Console.Read();
    }


0:011> !dumpheap -stat
00007ffaf84f7ad0        3    123455868 System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]

Let's talk about Slot[]. From the code above, you can see that there are three Slot[] arrays on the managed heap. That's interesting. How come there are three Ha's? It's not a little confusing. We'll find the addresses of three Slot[], one by one.


0:011> !DumpHeap /d -mt 00007ffaf84f7ad0
         Address               MT     Size
0000016c91308048 00007ffaf84f7ad0 16743180     
0000016c928524b0 00007ffaf84f7ad0 34719012     
0000016ce9e61020 00007ffaf84f7ad0 71993676  

0:011> !gcroot 0000016c91308048
Found 0 unique roots (run '!gcroot -all' to see all roots).
0:011> !gcroot 0000016c928524b0
Found 0 unique roots (run '!gcroot -all' to see all roots).
0:011> !gcroot 0000016ce9e61020
Thread 2b0c:
    0000006AFAB7E5F0 00007FFAF84132AE ConsoleApplication3.Program.Main(System.String[]) [C:\4\ConsoleApp1\ConsoleApp1\Program.cs @ 15]
        rbp-18: 0000006afab7e618
            ->  0000016C8000CC08 System.Collections.Generic.HashSet`1[[System.Int32, System.Private.CoreLib]]
            ->  0000016CE9E61020 System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]

As you can see from the above, I use gcroot to find the reference roots of these three addresses, two are none, and the last one is the new 599w size, right, next use! do to type out the values of these three addresses.


0:011> !do 0000016c91308048
Name:        System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]
Size:        16743180(0xff7b0c) bytes
Array:       Rank 1, Number of elements 1395263, Type VALUETYPE (Print Array)
Fields:
None

0:011> !do 0000016c928524b0
Name:        System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]
Size:        34719012(0x211c524) bytes
Array:       Rank 1, Number of elements 2893249, Type VALUETYPE (Print Array)
Fields:
None

0:011> !do 0000016ce9e61020
Name:        System.Collections.Generic.HashSet`1+Slot[[System.Int32, System.Private.CoreLib]][]
Size:        71993676(0x44a894c) bytes
Array:       Rank 1, Number of elements 5999471, Type VALUETYPE (Print Array)
Fields:
None

As you can see from the Rack 1, Number of elements information above, the original managed heap not only had Size: 2893249 before expansion, but also the previous Size: 1395263. So in this case, the total size on the managed heap is approximately 23.7M + 47.4M + 91.5M = 162.6M. I'll go instead.That is, there is 162.6 - 91.5 =71.1M of unclaimed garbage on the managed heapEuphorbiaThe total waste of the 47.4M virtual space occupied just now is 118.5M. I hope I have not made a mistake.

3. Is there a solution?

In List you can control List Size through Capacity, but unfortunately, there is no similar solution in HashSet, there is only one awkward clipping method: TrimExcess, which extends the current Size to the nearest quality value, as shown in the following code:


public void TrimExcess()
{
	int prime = HashHelpers.GetPrime(m_count);
	Slot[] array = new Slot[prime];
	int[] array2 = new int[prime];
	int num = 0;
	for (int i = 0; i < m_lastIndex; i++)
	{
		if (m_slots[i].hashCode >= 0)
		{
			array[num] = m_slots[i];
			int num2 = array[num].hashCode % prime;
			array[num].next = array2[num2] - 1;
			array2[num2] = num + 1;
			num++;
		}
	}
}

In this case, 289w is limited to 347w, and there is still 58w of space occupied.As follows:

Three: Summary

HashSet's virtual time and space share is much larger than you might think, and it's also smaller, because double array m_is used at the bottomSlots and m_buckets, each Slot has three more elements: struct Slot {int hashCode; internal int next; internal T value;}, so be careful when you understand the principles.

Posted by kulin on Tue, 16 Jun 2020 18:09:34 -0700