Is there a limit to cudaMallocHost memory allocation ? Mapping of buffer object failed while using c

I have a application where each PE(CPU thread) allocates its own instance of GPU context. Each threads calls cudaMalloHost for sizes from memPoolBoundaries[19] =[pow(2,8) : pow(2,26)] bytes when the application is initiated. Each PE has multiple buffer allocation for these sizes. If I run with lesser number of PEs say 8 it works, but on increasing the PEs to 64 it throws an error “mapping of buffer object failed”.

ALL PEs execute this in init phase

for(int i = 0; i < 19 ; i++){
int bufSize = CpvAccess(gpuManager).memPoolBoundaries[i]; // size of memory required (256,512,1024…)
int numBuffers = nbuffers[i]; // no. of allocation for size of bufSize
pools[i].size = bufSize;
pools[i].head = NULL;
Header *hd = pools[i].head;
Header *previous = NULL;
for(int j = 0; j < numBuffers; j++){
cudaChk(cudaMallocHost((void **)&hd,(sizeof(Header)+bufSize)));
It fails on mallocHost with an error mapping of buffer object failed

I really would like to understand how do I make it scalable to 100 or 1000s of nodes for cray machines.

This is outside my area of expertise, but from what I understand, cudaMallocHost() maps in relative straightforward fashion to a relevant OS function, I think mmap in Linux. So the size of such allocations is limited by the OS, and CUDA has no up-front knowledge of the available memory, it simply passes through the return status of the OS call, suitably translated.

If I remember correctly, cudaMallocHost requires tracking structures to be allocated in GPU memory so the GPU knows how to forward accesses to the CPU’s address space for that memory. So if you are right on the edge of running out of GPU memory, a big cudaMallocHost can push you over the edge.

I forget how much memory this takes, but you can figure it out by allocating a huge amount of memory with cudaMallocHost, and measuring the delta in device memory usage.

And as njuffa mentions, the OS can run out of pinned memory, and there are usually ways to increase the limit (e.g. ulimit).