Problems with Unified Memory Under Pascal

Inspired by the advantages of unified memory with Pascal, iswitch from Maxwell to Pascal.

But I have a problem all my systems and programms are not using unified memory with Pascal when I allocate memory with cudaMallocManaged. They are going back to use Zero Copy Memory. Did I miss something?

Running: Win_64bit i7-i930 with one GTX1080 and Win7_64bit i7-4790 with a GTX1080 and Win1064bit i5-6600k with GTX1060. All with VS2013. Also Nsight don’t show any unified memory allocation or PageFaulting or DataMigration.

It is the same programcode I used under Maxwell without any problems. To set CUDA_MANAGED_FORCE_DEVICE_ALLOC = 1 under CMake does also not help.

Additional Infos: cudaDevAttrConcurrentManagedAccess is 0

A hint what I am doing wrong would be nice. Thanks.

If you have multiple GPUs in the system, and those GPUs are not attached to the same PCIE root complex, then managed allocations become ZC allocations instead. This particular behavior is documented in the programming guide.

Do you have multiple GPUs?

If you want to work around this (force CUDA to only have 1 GPU in view) use the CUDA_VISIBLE_DEVICES environment variable, which is documented in the programming guide.

That is my problem. I don’t have multiple GPUs, but my system behavius like i would. I read the part of the documentation twice.

But thanks, you inspired me to try to set CUDA_VISIBLE_DEVICES and check how many device the cuda compiler means to see. It coud help me to find out what goes wrong.

Possibly somebody knows or has an idea how the nvcc compiler finds the GPUs and what could get wrong.

My last hint ist that something is wrong with my cudaMallocManaged implementation with C++ like in

https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/ Unified Memory with C++.

I used this to more simplicated my code, and get a kind of autopointer with constructors, destructors and copy funktions. I will try a clear project without this.

Okey, I now allready debuged the Cuda Samples for UnifiedMemory after Forcing the compiler to compile for arch=61. I also there have the same behavior, no PageFaults are triggered in Nsight. Also a repeat call of Kernels so that the Pages would be migrated to the GPU when Unified Memory would work show no better kernel time. The time stays constant so that it is ZeroCopyMemory

Or is a C++11 compiler necessary?