cudaMallocManaged() not allocating memory in device memory

owuntu · August 21, 2018, 9:29am

Hi, I have an issue that cudaMallocManaged() is not allocating device memory. It seems always using zero-copy system memory.I have also tried setting environment variable

CUDA_MANAGED_FORCE_DEVICE_ALLOC=1

but it doesn’t help.

My system spec:

Windows 10
Quadro P4000
CUDA 9.2
i7-7700k

Noted that I only have one P4000 GPU in my system, and cudaGetDeviceCount() also get 1.
Here is the code I use to test whether it use GPU memory:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>


class Giant
{
public:
	int mData[0xffff];
};

int main()
{
	using namespace std;
	std::size_t free, total;
	cudaMemGetInfo(&free, &total);

	Giant* p0;
	cudaMallocManaged(&p0, sizeof(Giant) * 1024);

	std::size_t rest;
	cudaMemGetInfo(&rest, &total);

	if (rest < free)
	{
		cout << "Using GPU memory\n";
	}
	else
	{
		cout << "Using system memory\n";
	}

	cudaFree(p0);

	return 0;
}

It prints

Using system memory

However, on my other machine with Windows 7, Xeon E5-2630 v3 CPU (other config the same), cudaMallocManaged() will allocate gpu memory with no issue. And the above test code prints:

Using GPU memory

GuyMadison · August 21, 2018, 4:22pm

I would rewrite the test and figure out the maximum allocation size, that’s probably the issue. Offhand it looks like you are allocating a single 256 MB chunk of memory. IIRC memory management in Windows 7 is pretty simple, it split the available vidmem in half and used half for CUDA. Windows 10 probably allows more memory to be used (I would be curious if it did).

owuntu · August 22, 2018, 2:11am

As far as I can see, the total memory is far more enough. Changing my test code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>


class Giant
{
public:
	int mData[100];
};

int main()
{
	using namespace std;
	std::size_t free, total;
	cudaMemGetInfo(&free, &total);
	cout << "Free: " << free << " Total: " << total << "\n";

	Giant* p0;
	cudaMallocManaged(&p0, sizeof(Giant) * 1024);

	std::size_t rest;
	cudaMemGetInfo(&rest, &total);
	cout << "Free: " << rest << " Total: " << total << "\n";

	if (rest < free)
	{
		cout << "Using GPU memory\n";
	}
	else
	{
		cout << "Using system memory\n";
	}

	cudaFree(p0);

	return 0;
}

It prints:

Free: 7129556582 Total: 8589934592
Free: 7129556582 Total: 8589934592
Using system memory

Event just malloc a single integer, the “Free” doesn’t change after cudaMallocManaged() call. But on Windows7 “Free” actually drop. And because cudaMallocManaged() not using device memory, the kernel which access those memory allocated by cudaMallocManaged() is running very slow.

owuntu · August 22, 2018, 3:35am

For my Quadro card as I know so far, the only way to make cudaMallocManaged() allocate memory in device is to switch to TCC driver on Windows 10. On Windows 7 this is not necessary however. I wonder if this is a feature or bug on Windows 10 or CUDA.

darkaegisagain · August 22, 2018, 3:58pm

“On Windows, the physical storage is always created in ‘zero-copy’ or host memory. All GPUs will reference the data at reduced bandwidth over the PCIe bus. In these circumstances, use of the environment variable CUDA_VISIBLE_DEVICES is recommended to restrict CUDA to only use those GPUs that have peer-to-peer support. Alternatively, users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-zero value to force the driver to always use device memory for physical storage. When this environment variable is set to a non-zero value, all devices used in that process that support managed memory have to be peer-to-peer compatible with each other.”

So it should work… but it probably doesn’t work on Windows 10 due to the memory manager used in W10. I would have to investigate a bit more but I suspect Windows 10 uses a shared memory manager between CUDA and Windows 10 clients because giving up half your available memory to run CUDA clients would just suck for those that do other things with their computers.

Topic		Replies	Views
sth wierd about managed memory and free GPU memeory size CUDA Programming and Performance	2	643	November 25, 2019
Large allocations with cudaMallocManaged slow down synchronization CUDA Programming and Performance	11	1582	October 26, 2020
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3123	July 13, 2018
cudaMallocManaged suceeds, but memory access fails, for size greater than hardware memory CUDA Programming and Performance	2	1131	April 26, 2017
Question about GPU Memory Overhead with Cudamallocmanaged CUDA Programming and Performance	7	977	August 21, 2024
Memory cleared after cudaFree CUDA Programming and Performance	1	831	May 11, 2016
cudaMallocManaged malloc memory not same as requested CUDA Programming and Performance	7	794	April 17, 2019
CudaMalloc on Vista : strange behaviour Works on XP, Fails on Vista CUDA Programming and Performance	6	12258	July 1, 2009
Cuda Memory Usage TX1 Jetson TX1	8	4525	December 16, 2015
Slow cudaMalloc (~1.5s) and slow mem access there, allocating nearly whole memory, with WDDM CUDA Programming and Performance	0	1090	June 18, 2014

cudaMallocManaged() not allocating memory in device memory

Related topics