In the past have allocated about 28-30 GB of host pinned memory via cudaHostAlloc without issue.
Tried a larger data data set and attempted a single allocation of just over 32 GB of pinned memory which resulted in an ‘out of memory’ error. This was in addition to another 17 GB allocation, so the total was in the range of 49 GB, well below the total available for the system (128 GB).
Googled the issue but found no useful information (so far).
My questions are;
is there a way to query the max amount allowed for a single host pinned allocation? What about for the total which could be allocated?
Is this issue more likely due to too large of a single pinned allocation, or the aggregate total amount of the pinned allocations?
Similar questions have been asked many times, and I have never seen a satisfactory answer, for any OS supported by CUDA. cudaHostAlloc() is just a thin wrapper around OS API calls, and as CUDA programmers we are at the mercy of the OS whether the allocation will succeed or not. I do not know which OS API call is ultimately invoked for Windows, it might be something like MmAllocateContiguousMemory(): “allocates a range of contiguous, nonpaged physical memory and maps it to the system address space”.
I have searched for relevant information in MSDN before, and have never been able to find even a 10,000ft explanation of what factors play into limits on the size of such allocations. Microsoft only warns that the allocation could be expensive due to fragmentation:
A corollary would seem to be that fragmentation can also limit the size of the allocation, and that leads me to speculate that the limit encountered might be different at different times, depending on the amount of fragmentation currently encountered. So an interesting experiment might be to check maximum usable allocation size on a system with an up-time of several weeks that has been used heavily versus the same system freshly booted. If fragmentation is the main issue that limits size, one could also speculate that multiple smaller allocations are more likely to succeed than one giant allocation.
Just to follow up, breaking the larger allocation into smaller chunks did not work, so there seems to be limit on the total amount allocated rather than the amount allocated for a single allocation.
Maybe you can spot something interesting from the output. I don’t have experience with this tool, but “nonpaged pool” and “driver locked” look like potentially relevant memory usage types. You can look at the entire memory map in detail.
I re-wrote the code and reduced the amount of pinned host memory to about 18 GB total.
For my particular system with Windows 10 it seems the sum total cannot exceed 32 GB, and in general large allocations over 9 GB are slow.
I split one large 17 GB pinned host allocation into two 8.5 GB allocations and the total allocation time was cut almost by half.
So the bottom line is that there are definitely limits to pinned contiguous host memory allocations and smaller allocations are faster.
Will test this on my Ubuntu box next, and I expect that will be probably be less problematic.
Wasted some money on a 128 GB system thinking that I could use a good chunk of that host DRAM as pinned memory, but at least now I know better.
Being able to lock 25% of 128 GB should be better than being able to lock 25% of 64 GB, I assume. And you are still doing better than the CUDA user from a forum thread from last year who could only pin 4 GB on a Linux system with 128 MB of system memory.
The frustrating thing is that the operating system vendors don’t seem to document the limits, nor do they document what configuration settings (e.g. registry keys under Windows) might be able to increase the available space. I am not surprised that the OS imposed limits are fairly low. Modern operating systems are built around, and optimized for, the use of a virtualized memory space. Using lots of pinned memory runs counter to that and could have noticeable negative impact on OS performance.
How significant is the performance difference you are seeing with transfers from/to the GPU using pageable versus pinned memory? By my calculations the difference in throughput should only be about 20%, so unless you are close to maxing out the throughput of the PCIe link, it shouldn’t have much of an impact on app performance.
For what it is worth, my attempt to allocate 6GB of pinned memory with cudaHostAlloc() on a Windows 7 machine with only 8GB of physical memory was successful. Either there is a fixed absolute size limit, rather than a percentage one, or the limit is different between Windows 7 and Windows 10. My test app:
#include <stdio.h>
#include <stdlib.h>
int main (void)
{
unsigned char *p;
cudaError_t stat;
size_t siz = 1024ULL * 1024 * 6200;
stat = cudaHostAlloc (&p, siz, cudaHostAllocDefault);
if (stat == cudaSuccess) {
volatile unsigned long long count = 1ULL<<34;
printf ("allocation succesful\n");fflush(stdout);
do {
count--;
} while (count);
cudaFreeHost(p);
}
return EXIT_SUCCESS;
}
When we switch to Windows 10 we double the amount of physical RAM to get the same allocation that we can have with Windows 7 (same motherboard, same GPUs, same main application that use CUDA)
Moreover the maximum amout of pinnable memory depends on the number of installed GPU card. Now we test a system with 8 GPU cards (RTX3090). With 128GB of physical RAM the maximum pinnable buffer that we can have is about 6GB and its size decrease at each new allocation.
If I remove 4 GPU cards the first allocation can be about 11GB.
This means that pinnable (not pageable) memory is almost unusable on windows 10.
I even try to use Windows 10 server but I get the same sad results.
A quick experiment with a Windows 10 system currently under heavy load shows that I can allocate 7.1 GB of pinned host memory as a first allocation from a total of 32 GB of system memory.
I suspect that operating system folks would point out that allocating huge physically contiguous buffers is anathema to the address space virtualization that modern operating systems are designed around.
In practical terms, it might be a good idea to simply use regular pageable memory and focus on deploying systems with high system memory throughput. For an HPC system, four channels of DDR4-2666 should be considered the minimum. Current platforms offer as much as 8 channels of DDR4-3200, best I know.
Side remark: I would consider 128 GB of system memory underprovisioning for a system with eight RTX 3090s and even underprovisioning for a system with four RTX 3090s. I don’t know your use case, but in general one would want to shoot for system memory that is 2x to 4x the total GPU memory. It is usually not a good idea to skimp on system memory for “fat” nodes.