cudaMallocManaged and CUDA 8.0

Hi,

I’m trying to make my code run quickly on CUDA 8.0 after the change in behaviour for managed memory (see the main answer to http://stackoverflow.com/questions/39782746/why-is-nvidia-pascal-gpus-slow-on-running-cuda-kernels-when-using-cudamallocmana for background).

I have code that utilises the nice Managed class for writing portable code between c++ and CUDA:

class Managed {
public:
void *operator new(size_t len) {
void *ptr;
cudaMallocManaged(&ptr, len);
cudaDeviceSynchronize();
return ptr;
}

void operator delete(void *ptr) {
cudaDeviceSynchronize();
cudaFree(ptr);
}
};

(see https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/).

In order that my managed code is quick, I need to the GPU to preload the managed memory. I added a method to my class that derives from Managed:

__host__ void SyncronizeGPUMemory()
{
	void* ptr = (void*) this;
	size_t byteSize = sizeof(MyManagedClass);
	cudaMemPrefetchAsync(ptr, byteSize, 0);
}

and call this method before I launch the kernel that utilises the managed memory object.

However, I get an error when debugging the kernel:

CUDART error: cudaMemPrefetchAsync returned cudaErrorInvalidDevice

Has anyone seen this or has an idea of what I’m doing wrong? I was wondering if it is okay to put 0 in the third parameter of cudaMemPrefetchAsync for the stream (I’m not using streams).

Many thanks

Check if cudaDevAttrConcurrentManagedAccess is 0, that is the reason why i can’t use this call, i get the same error. But my Problem is deeper, possibly you have the same problem.

I allso use cudaMalocManaged in an c++ Klasse. But my code realy don’t use unified memory it fals back to ZeroCopy Memory. This only happens with Pascal GPUs. Is still fast but when i compare to Maxwell Gpu it is 50-100x slover. And i can’t see any pageFaults in NSight. All my kernel are limmited by the CPU-GPU- datatransfer

Hi,

ConcurrentManagedAccess: 0 is a BUG of NVIDIA on Windows Systems witch occurs with PASCAL architecture.

I know this since a few days, but could not write it here because i was on vacation without internet connection.

For details see the comments of: https://devblogs.nvidia.com/parallelforall/unified-memory-cuda-beginners/
where Mark Harris from NVIDIA confirms the Bug. It should be corrected with CUDA 9. He also tells that it should be communicated to Microsoft to help the caus. But i didn’t found a suitable Microsoft Bug Report Page till now.

If you run a windows system it should be your caus.

Almost a year later and this problem/bug has not been fixed. Any update on this?

The update is that the feature has been removed from Windows as of CUDA 9.0. It is only supported on Linux.

got it. Thanks txbob.