mlock versus cudaHostAlloc

I’m trying to allocate big amounts of pinned memory in chunks of 1GB.
The machine is running linux and has two NUMA nodes. I’m running on a node with 250GB RAM.

When I try to allocate the 251 buffer (just beyond the RAM of the NUMA node), with mlock
the allocation goes to the other NUMA node and continues to allocate, with cudaHostAlloc the code
hangs and then crashes.

Any idea why this happens with cudaHostAlloc?

const unsigned int MAX_PINNED_POINTERS = 100 * 10;
    const unsigned int PINNED_SIZE = 1024 * 1024 * 1024;
    char *cuda_pinned[MAX_PINNED_POINTERS];
    char *mlock_pinned[MAX_PINNED_POINTERS];

    cout << "Allocating buffers with malloc/mlock" << endl;
    for (unsigned int i = 0; i < MAX_PINNED_POINTERS; i++)
		std::cout << "Allocating buffer [" << i << "] with malloc/mlock" << endl;
		mlock_pinned[i] = (char *)malloc(sizeof(char) * PINNED_SIZE);
		int res = mlock(mlock_pinned[i], sizeof(char) * PINNED_SIZE);
		if (res != 0)
			cout << "Error locking buffer with mlock" << endl;

    cout << "Allocating buffers with cudaHostAlloc" << endl;
    for (unsigned int i = 0; i < MAX_PINNED_POINTERS; i++)
		std::cout << "Allocating buffer [" << i << "] via cudaHostAlloc" << endl;
		cudaHostAlloc((void **)(&(cuda_pinned[i])), sizeof(char) * PINNED_SIZE, cudaHostAllocDefault);

Any idea?
After some research it seems that cudaHostAlloc can NOT pass the NUMA node boundaries - i.e. if the CPU thread is running on NUMA node #1, when all memory assign to NUMA node #1 (half of the total memory in a 2 node configuration) is exhausted by cudaHostAlloc the program will grind to a halt and crash.
However allocating pageable memory and then using cudaHostRegister does NOT show this behaviour and I am able to pin memory (verified via NVPROF) from the entire RAM of both NUMA nodes.

Any idea why? Is this an issue/by-design with cudaHostAlloc that cudaHostRegister somehow lifts?


I have a dual-socket SMC server with 128G of memory, 64G per socket.

I ran your test but limited it to 100G total allocation (i.e. 100 loops), and I only ran the cudaHostAlloc loop. It had no trouble allocating up to 100G. CentOS7, CUDA 10.1.243

I also tried an Ubuntu 16.04 dual-socket server with 160G, no problem with the loop count set to 150, CUDA 10.0.130

I also tried another Ubuntu dual-socket server with 384G, no problem with the loop count set to 350, CUDA 10.1.243

I see no indication that cudaHostAlloc cannot allocate pinned allocations that are resident in the memory attached to the “other” CPU socket.

The way your code is written it is set up to allocate 1000G (10*100) but you seem to be indicating you only have ~500G in your server. (Yes, I understand you can oversubscribe your RAM with just malloc. I don’t know what the behavior with malloc/mlock is, maybe still possible to oversubscribe, but you cannot oversubscribe cudaHostAlloc in the same fashion.) Also, I did not run the mlock loop. It may be that if you attempt to allocate 1000G via malloc/mlock, that further attempts to subsequently use cudaHostAlloc are behaving strangely while the 1000G is still allocated via malloc/mlock.

Also, I assume this is fairly obvious, but it should be evident that cudaHostAlloc is apparently different than mlock:

Hi Robert,
Thanks a lot for the detailed answer. I was indeed able to run this test on a different machine and it went through. I’ll have to check why the other machines failed (CUDA version/configuration/…) and report back when I know the answer.

The stackoverflow link you’ve supplied was a real help.

On a general note - what you guys are doing here, helping others via this forum through the years is amazing. This is another thing that differentiates nvidia from the others… :)


Seems I found the problem. The issue was with Linux’s vm.zone_reclaim_mode parameter.

See here:

Setting it to zero, solved the problem on the problematic machines.