Hi, I have an issue that cudaMallocManaged() is not allocating device memory. It seems always using zero-copy system memory.I have also tried setting environment variable
CUDA_MANAGED_FORCE_DEVICE_ALLOC=1
but it doesn’t help.
My system spec:
Windows 10
Quadro P4000
CUDA 9.2
i7-7700k
Noted that I only have one P4000 GPU in my system, and cudaGetDeviceCount() also get 1.
Here is the code I use to test whether it use GPU memory:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
class Giant
{
public:
int mData[0xffff];
};
int main()
{
using namespace std;
std::size_t free, total;
cudaMemGetInfo(&free, &total);
Giant* p0;
cudaMallocManaged(&p0, sizeof(Giant) * 1024);
std::size_t rest;
cudaMemGetInfo(&rest, &total);
if (rest < free)
{
cout << "Using GPU memory\n";
}
else
{
cout << "Using system memory\n";
}
cudaFree(p0);
return 0;
}
It prints
Using system memory
However, on my other machine with Windows 7, Xeon E5-2630 v3 CPU (other config the same), cudaMallocManaged() will allocate gpu memory with no issue. And the above test code prints:
I would rewrite the test and figure out the maximum allocation size, that’s probably the issue. Offhand it looks like you are allocating a single 256 MB chunk of memory. IIRC memory management in Windows 7 is pretty simple, it split the available vidmem in half and used half for CUDA. Windows 10 probably allows more memory to be used (I would be curious if it did).
Free: 7129556582 Total: 8589934592
Free: 7129556582 Total: 8589934592
Using system memory
Event just malloc a single integer, the “Free” doesn’t change after cudaMallocManaged() call. But on Windows7 “Free” actually drop. And because cudaMallocManaged() not using device memory, the kernel which access those memory allocated by cudaMallocManaged() is running very slow.
For my Quadro card as I know so far, the only way to make cudaMallocManaged() allocate memory in device is to switch to TCC driver on Windows 10. On Windows 7 this is not necessary however. I wonder if this is a feature or bug on Windows 10 or CUDA.
“On Windows, the physical storage is always created in ‘zero-copy’ or host memory. All GPUs will reference the data at reduced bandwidth over the PCIe bus. In these circumstances, use of the environment variable CUDA_VISIBLE_DEVICES is recommended to restrict CUDA to only use those GPUs that have peer-to-peer support. Alternatively, users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-zero value to force the driver to always use device memory for physical storage. When this environment variable is set to a non-zero value, all devices used in that process that support managed memory have to be peer-to-peer compatible with each other.”
So it should work… but it probably doesn’t work on Windows 10 due to the memory manager used in W10. I would have to investigate a bit more but I suspect Windows 10 uses a shared memory manager between CUDA and Windows 10 clients because giving up half your available memory to run CUDA clients would just suck for those that do other things with their computers.