cudaHostRegister(): strange/unexpected behaviour under Windows 10

Hi to all,

I have a problem on using cudaHostRegister() under a Windows 10 64 bits system.
On the same PC just switching to Windows 7 we never experience such problem.

I try to explain it.

Our application allocate a lot of memory buffers and pin it using cudaHostRegister().

Under Windows 7 we can pin as much memory as we want, the limit it is just the physical available memory.
Under Windows 10 as the pinned memory is about half of the physical memory, the cudaHostRegister() fails with an “out of memory” error.

The PC has:

  • Supermicro X10DRG-Q MB with two numa nodes
  • 2 Xeon E5-2690 CPU
  • 64 GB of RAM (32GB installed for each node).
  • 2 RTX2080Ti GPU (each one connected to the PCI bus directly handled by a CPU)
  • an nVidia 710 as video card

Both windows 10 and window 7 are 64 bits with latest CUDA 10 and latest nVidia drivers.

To generate the problem, in a cycle

  • allocate a memory buffer (for instance of 0.5GB)
  • ping it using cudaHostRegister()

Under Windows7 i can pin as much as I want (I just stop to 55GB to avoid to block the system)
Under Windows10 the cudaHostRegister() fails after the total pinned memory is around half of the system memory: with 64 GB it stops around 28.5GB, with 32GB (I remove some ram) it stops at 14.5GB

Anyone experience a similar problem?

I notice another difference between windows 7 and windows 10.

If I look to the properties of the display, under windows 7 the shared system memory for the vido is 3GB, while under Windows 10 is 32 GB. Anyway I don’t think is this the problem, but something related to the maximum non pageable memory.

Here my code.

#include <windows.h>
#include <stdio.h>
#include <tchar.h>
#include <exception>
#include <cuda_runtime.h>
//#include "mycuda.h"
#include <vector>
#include <conio.h>


#pragma comment(lib, "cudart.lib")
int numDevices = 0;

struct MYBUF {
	BYTE *pBuf;
	size_t size;
	bool		pinned;

	MYBUF()
	{
		pBuf = NULL;
		size = 0;
		pinned = false;
	}
};

std::vector<MYBUF> myBufs;

int main(_In_ int _Argc, _In_reads_(_Argc) _Pre_z_ char ** _Argv, _In_z_ char ** _Env)
{

	int ret= cudaGetDeviceCount(&numDevices); // InitCuda(0, -1, -1);
	
	if ( ret != cudaSuccess || numDevices < 1)
	{
		_tprintf(TEXT("No cuda devices detected. Ret: %d\n"), numDevices);

		return -1;
	}
	else
	{
		_tprintf(TEXT("Detected devices: %d\n"), numDevices);
	}

	
	const size_t ONE_KB = 1024;
	const size_t ONE_MB = 1024 * ONE_KB;
	const size_t ONE_GB = 1024 * ONE_MB;

	size_t size = 10*ONE_GB;	// 1GB
	size_t step = 512 * ONE_MB;
	
	bool ok = true;

	size_t total_size=0;
	while ( ok )
	{
		MYBUF	buf;
		buf.size = 512 * ONE_MB; // ONE_GB;

		try
		{
			_tprintf(TEXT("Allocate and pin buf %03u...."), myBufs.size() + 1);
			buf.pBuf = new BYTE[buf.size];

			cudaError_t ret = cudaHostRegister(buf.pBuf, buf.size, 0);

			if (ret != cudaSuccess)
			{
				printf("Pinning of %.3f GB failed. ret = %d (%s).\n", (float)buf.size / ONE_GB, ret, cudaGetErrorString(ret));
				ok = false;
			}
			else
			{
				total_size += buf.size;
				_tprintf(TEXT("Pinning ok. Total size: %.3fGB\n"), (float)total_size/ONE_GB);
				buf.pinned = true;
			}

			myBufs.push_back(buf);

			if (total_size >= 55 * ONE_GB)
			{
				_tprintf(TEXT("Stop\n"));
				ok = false;
			}
		}
		catch (std::bad_alloc) 
		{
			_tprintf(TEXT("Unable to allocate %.1f GB.\n"), (float)buf.size / ONE_GB);
			buf.pBuf = NULL;
			ok = false;
		}

		
	}

	_tprintf(TEXT("Press any key to continue\n"));
	_getch();

	for (size_t i = 0; i < myBufs.size(); i++)
	{
		if ( myBufs[i].pinned )
		{
			_tprintf(TEXT("Unpin buf %03d..."), i + 1);
			int ret = cudaHostUnregister((void *)myBufs[i].pBuf);
			if (ret != cudaSuccess)
			{
				_tprintf(TEXT("Failed. Ret: %d."), ret);
			}
			else
			{
				_tprintf(TEXT("Ok."));
			}
		}
		else
		{
			_tprintf(TEXT("Buf %03u is not pinned."), i + 1);
		}
		if ( myBufs[i].pBuf != NULL )
		{
			_tprintf(TEXT("     Deallocate buffer memory\n"));
			delete [] myBufs[i].pBuf;
		}
	}
	return 0;
}

This is the output (just the allocation part)

Allocate and pin buf 001....Pinning ok. Total size: 0.500GB
Allocate and pin buf 002....Pinning ok. Total size: 1.000GB
Allocate and pin buf 003....Pinning ok. Total size: 1.500GB

[...]

Allocate and pin buf 046....Pinning ok. Total size: 23.000GB
Allocate and pin buf 047....Pinning ok. Total size: 23.500GB
Allocate and pin buf 048....Pinning ok. Total size: 24.000GB
Allocate and pin buf 049....Pinning ok. Total size: 24.500GB
Allocate and pin buf 050....Pinning ok. Total size: 25.000GB
Allocate and pin buf 051....Pinning ok. Total size: 25.500GB
Allocate and pin buf 052....Pinning ok. Total size: 26.000GB
Allocate and pin buf 053....Pinning ok. Total size: 26.500GB
Allocate and pin buf 054....Pinning ok. Total size: 27.000GB
Allocate and pin buf 055....Pinning ok. Total size: 27.500GB
Allocate and pin buf 056....Pinning ok. Total size: 28.000GB
Allocate and pin buf 057....Pinning of 0.500 GB failed. ret = 2 (out of memory).

I made some test with other PCs.

  • My laptop (i7 with 16GB of RAM and an integrated nvidia GPU) and another PC with i7 32 GB of RAM and a RTX1080 and I have the same problem.

When the total pinned memory reach half of the system memory the cudaHostRegister() fail.

The same if I simply use cudaHostAlloc().

It seems to be a bug or an unwanted feature of Windows 10

A similar question came up a while ago:

https://devtalk.nvidia.com/default/topic/1027428/max-amount-of-host-pinned-memory-available-for-allocation/

Generally speaking, modern operating systems are built around the concept of virtual memory. Physically pinning memory runs counter to that. It may lead to performance issues if too much memory is pinned, and operating systems may therefore purposefully limit the amount of memory that can be pinned.

Unfortunately the operating system vendors don’t usually carefully document what limits on pinned allocations they have built into their software. Maybe you can find out more about the limits imposed by various Windows versions specifically in a Microsoft developer forum.

Thanks

I saw that post too.

Anyway our application is a realtime measuring system where all the RAM must be used by our application.

If other user applications suffer for this “does not matter”.

Under Windows 7 we can pin as much as we need, with Windows 10 we have to double the physical RAM to get the same memory allocation!

Anyway now I try to play with some registry keys

I am experiencing exactly the same behavior under Windows 10. It fails when reaching half of the physical memory in my system (64 GB). I have not tested under Windows 7 though.

It seems that this limit is set by CUDA, not the OS. When I manually allocate and lock my buffers (using VirtualAlloc/VirtualLock), then I can reach the total amount of physical memory in my system.

From all the tests I did, the rule set by CUDA looks like this:

((Working set of process) + (RAM used by pinned memory)) <= (physical RAM in system)

As the pinned memory is part of the working set of the process, then we cannot allocate more than 50% of the available RAM. In my tests I have manually increased the working set in my process and pinned allocations fail before reaching 50% of the RAM, according to the above rule.

CUDA developers have been speculating on this subject for many years now. It would be nice if this was documented in the CUDA specification. At least, they should say if it is limited by the OS or not.