Change limit of 50% for cudaHostAlloc pinned memory on Windows 10/11

On Windows 10/11 one can allocate only 50% of RAM using cudaHostAlloc pinned memory. It does not matter if allocation is done by one big part or small chunks. The limit is also not influenced by usual memory allocated by malloc(). So one can do first malloc() even more than all available RAM (then it will be swapped by Windows) and the limit of 50% of total RAM will not be changed.

There are a lot of posts in the web nut still no answer. Do someone knows if this 50% is pure CUDA internal limitation or it somehow related to somewhat called “Windows GPU shared memory” (has nothing to do with CUDA kernel shared memory). One one can find in my StackOverflow post.

So how can one change this limit either by driver, CUDA or windows tuning?

If you want to know what is going on, use a system trace utility, or run targeted tests directly using the appropriate OS APIs. This will tell you with certainty whether CUDA imposes limits that are not imposed by the OS.

To my knowledge, cudaHostAlloc() feeds straight through to operating system APIs. I am 95% sure that the limitations observed are due to the OS imposing limits. Please note that in general pinning memory is anathema to any OS based on the concept of virtualized memory, and that is pretty much all of the common ones outside of specialized RTOSes.

Frankly, I would consider pinning more than 50% of the system memory a form of aggravated operating system abuse. I cannot claim to have experience with any possible use case under the sun, but I cannot think, off the top of my head, of a use case that would require pinning more than 50% of system memory.

The usage model envisioned by OS folks for pinned memory APIs is to allow for the creation of several reasonably-sized buffers (say, each up to 32 MB or so) for communication between the host system and devices.

But then I see folks asking on Stackoverflow why AllocateCommonBuffer() in Windows 10 will not let them allocate a buffer larger than 2 GB. I am surprised that it is even possible to allocate 50% of RAM to pinned allocations, because typically fragmentation of physical memory happens rather quickly when a system is in use and prevents that from happening. From what I read on the internet, it seems that Microsoft has made some improvements to the management of non-paged memory in Windows 10/11 to make it more “fragmentation resistant”.

Dear @njuffa. Thanks a lot for your answer.

Generally I completely agree with you about usage of pinned memory in common situations. But there some specific applications where it is a must. For example scientific computation with large data volumes where a RANDOM access is a must. Just because algorithms. So a kernel must have access to any piece of the data at ANY time. Such data volumes usually start from 100 Gb and more. For such application a special powerful PCs are shipped and memory amount is carefully chosen to allow such large fixed allocations while the rest system works.

On Linux manager memory with over-subscription and page-on-demand work just fine. For Windows the only choice is to use slower (in case of repeated requests) pinned mapped memory. It was just fine with Windows 7. But with Windows 10/11 and new nVidia drivers the situation is like I’ve described. The only solution is either to double RAM which is costly at volumes of hundreds GB, cancel Windows support which is impossible.

We have done a lot of tests - result is always the same. So it is definitely not fragmentation. I have also doubts that it is OS related. The is no problem to allow much more than 50% with VirtualAlloc and then lock them. So it seems nVidia related. So it seems that it is what you meant under “targeted OS APIs tests”. It works as it should.

It would be nice if you can give an idea about “system trace utility”, but I really doubt that we will see if CUDA (or the driver) is a limiting factor. Direct system calls works as I have mentioned above. What should we see there?

There quite some similar questions over internet after releasing Windows 10 for years. Nobody knows the answer. It is a rule of thumb in forums that CUDA allows only 50% allocation. The issue can easily be tested (3 lines). It is pity than nobody from nVidia can help to understand and solve the situation.

We appreciate a lot your help and rapid answer. Will be great if someone could help.

I have done a draft test just to check myself. Indeed I can easily allocate 24 GB of not pageable memory by direct WinAPI calls. CUDA fails at the same conditions.

Here is a short piece of code:

// Try to reserve pinned 24 Gb on machine with 32 Gb of RAM
size_t giga = 1024 * 1024 * 1024;
auto size = 24 * giga;

// CUDA test
char* p1{ nullptr };
cudaHostAlloc(&p1, size, cudaHostAllocMapped);

// WinAPI test
void* p2{ nullptr };
auto process = GetCurrentProcess();
size_t min, max;
GetProcessWorkingSetSize(process, &min, &max);
if (SetProcessWorkingSetSize(process, size + 65536 + min, size + 65536 + max)) {
	p2 = VirtualAlloc(nullptr, size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
	if (p2) {
		if (!VirtualLock(p2, size)) {
			VirtualFree(p2, size, MEM_RELEASE);
			p2 = nullptr;
		}
	}
}

// At this point p1 == 0 and p2 != 0, SetProcessWorkingSetSize does not help CUDA (and not needed)

So it seems not to be system related, or at least not allocation functions related. Can be WDDM >= 2.0 related, or CUDA, or driver. About WDDM it was my question on StackOverFlow. No idea how check any of them from our side.

What would you propose according to your experience?

Many thanks in advance.

I am not sure what the test is supposed to demonstrate. To my knowledge, the Windows API for allocating pinned memory suitable for DMA access is AllocateCommonBuffer(). I am not aware that it has ever been possible to allocate pinned memory up to the size of physical system memory, regardless of whether Windows 7, Windows 10, or Windows 11 are used.

FWIW, I have seen plenty of “scientific computation with large data volumes” accelerated by CUDA and none required pinning huge amounts of system memory.

If you think limits imposed by CUDA APIs are in error, you can always file a bug with NVIDIA. If not considered an error, but just unnecessarily limiting, file a feature request (use the bug reporting form, prefix synopsis with "RFE: " to mark as enhancement request , rather than functional bug).

I am really not up-to-date on this, thus the vague reference. Microsoft provides an overview of available tracing utilities here. I see a section “DTrace on Windows” which piques my interest having worked with Solaris before. I would expect all these tracing tools to have various limitations (e.g. restrictions to particular versions of Windows), so I would suggest reading the documentation carefully before diving in.

Then I’m not sure what CUDA meant under the word “pinned”. Can not find too much information on this. The problem which is always stated is that “pinned” memory is not pagable and it is bad to allocate a lot, and so on. So in every place in the web where we have found something about this topic it was considered like in my example above - VirtualAlloc/VirtualLock. I believe that AllocateCommonBuffer is something different ans designed for small DMA buffers, but I’m not an expert.

There are plenty of different scientific tasks. Some of them are quite specific and needs it. It works under Linux, it works under Windows 7. It was broken since Windows 10 since WDDM 2 and updated drivers. I’m quite sure it is connected to how it handles the memory and how driver behaves with it. My real feeling is relation to GPU Shared Memory in WDDM 2. Would be nice to here a expert on this.

I just have reviewed my old search on this topic and find your comment that you able to allocate 6 out of 8 GB of pinned memory on Windows 7. The post just below yours told that is the difference between Windows 7 and Windows 10.

https://forums.developer.nvidia.com/t/max-amount-of-host-pinned-memory-available-for-allocation/56053/7

Then it is a discussion about fragmentation and large pieces. Again, for us we are unable to allocate more then 50% of RAM even in small(er) chunks. The total amount is just fixed.

The thread was dated by year 2017.

We do not consider it as a bug. We and other people just wanted to hear exact answer, line “you can do this in CUDA or this in Windows 10 to change the limit”, or “it is hardcoded because of some reason and cannot be changed”. Then we can work further by some other means.

Thanks for your help.

Leaving aside any complications from the potential use of an IOMMU, a pinned memory allocation is a physically contiguous chunk of system memory, suitable for data transfer by DMA.

The CUDA driver maintains a pinned memory buffer internally. Its size appears to be on the order of low single-digit MBs as best as can be established without access to the internals. Any data transfers between GPU and pageable system memory go through this buffer, broken into multiple chunks if need be. Such copy operations involve two stages: (1) DMA transfer from/to GPU from pinned buffer (2) Copy between pinned system memory buffer and application-owned system memory. By using cudaHostAlloc(), a CUDA application can allocate its own pinned buffer, suitable for DMA transfers, which makes the copies less expensive and allows cudaMemcpyAsync() to be fully asynchronous so copies and compute can overlap.

I do not know with certainty which Windows API calls CUDA uses under the hood to implement cudaHostAlloc() and I am not a Windows operating system specialist. I am a generalist who has worked with Solaris, Linux, Windows, and a bit of MacOS. TheAllocateCommonBuffer() API seemed like the closest match to the functionality provided by cudaHostAlloc(), but maybe that is not what is being used and me bringing it up is a red herring.

This is the reason I suggested using a system trace utility to find out for sure what OS APIs get invoked by cudaHostAlloc(). That would allow you to find out what software component imposes the limit, and then “bark up the correct tree” to address it.

The introduction of WDDM 2.x in Windows 10 happened too long ago for me to recall what the salient changes from WDMM 1.x as used by Windows 7 were. The point of WDDM (as compared to the previous Windows XP driver model) is to give Windows full control over the GPU, in particular also GPU memory allocation. Maybe there was some further tightening of controls. Obviously as an operating system Windows has always had full control over system memory.

Windows 10/11 has many changes compared to Windows 7. You would want to consult with a Windows specialist as to which of these might restrict pinned allocations more in Windows 10/11 than Windows 7.

The limit is entirely managed by Windows, and a typical limit is 50% of system memory. Sorry, I don’t have further details to share. No the limit is not 50% of system memory in all cases. I don’t have further details to share. I don’t have or know of any mechanisms to modify or override the observed limit, and the NVIDIA driver doesn’t control or set the limit.

I don’t have any permission to explain exactly how the driver pins memory, or confirm any of the speculation in this thread.

Because of additional WSL2->WDDM machinery that I don’t have permission to describe, the pinning limits on WSL2 are likely to be lower than what you would observe strictly on the windows side.

You’re welcome to file a bug to either 1. request additional documentation in this area, or 2. to “increase the limit”. Based on the information I have available to me, for the reasons already indicated, bugs along the line of 2 are unlikely to be actionable.

Using the program from my 2017 post, I can confirm anecdotally that the maximum amount of memory pinnable in a single chunk with cudaHostAlloc() under Windows 10 is slightly less than 50%. On a Windows 10 system with 32 GB of system memory, the largest size allocation that succeeds is

    size_t siz = 1024ULL * 1024 * 15402;

So 15.04 GB out of 32 GB present in the system, or 47%. I did not perform this test on a freshly booted system, but a system that has been lightly used since the last reboot a few weeks ago, when it went down in a power outage.

I have done another test. VirtualAlloc/VirtualLock and then cudaHostResgister on that. So the memory is allocated, pinned and only after registered by Cuda. Under ~50% of memory it successfully passed, just above starts to fail (you right, it is not exactly 50%). So it seems that the problem is not in memory allocation itself, not in pinning but in WDDM<->CUDA interop.

Anyway I would like to thanks everybody a lot. Now it is clear that it is fixed. So we will look for another options.