Every time I’m running NVIDIA Visual Profiler (Ver 8.0) the following warning appears on the console:
==3348== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory
The corresponding memory profiling information is missing. That happens when I use the profiler to analyze my code or any CUDA sample application. But it seems to be only one visible GPU device on my PC, which is GeForce GTX 1070. And “ GeForce GTX 1070” is the only root node shown on the profiler screen. Invoking cudaGetDeviceCount also gets 1. But if the there is only ONE cuda visible GPU device, what is “a PAIR of devices without peer-to-peer support” that are mentioned in the warning? What is wrong with the profiler or possibly with my system configuration? How it can be fixed?
I guess you have two different gpus in your computer, right?
Well, no. That’s what is confusing me. In the original message I explained that cudaGetDeviceCount gives 1. So there is only one CUDA visible GPU. On the other hand, there possibly is some other gpu built in msi motherboard, but it does not seem to be CUDA enabled.
I am getting this warning when using nvprof: Warning: Unified Memory Profiling is not supported on devices of compute capability less than 3.0
However, its showing the profiling results which I doubt is correct.
I am new to cuda programming so just looking into sample codes.
In 1d stencil sample code on trying 3 different scenarios I am getting profiling number as:
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) < ( kernel_2<<<1,N>>> )
N is multiple of 32.
kernel_2 is using shared memory.
According to theories the order what I got is not correct & It should be exactly opposite.
Is there anyway from which I could get correct result?
Could you share the sample you use? Under normal circumstances, shared memory is faster than global memory, but GPU has it’s cache, global memory operation will be optimized by cache and sometimes it’s even faster than shared memory if cache is enough.
Above is the sample code I was talking about.
Your GPU cache explanation may be the reason for Kernel_2 but still
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) is not correct I feel.
should’nt it be exactly opposite.
I also have this problem, but what I care about is not the unified memory profiling, I’m care about it could cause “system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower.”(warning words). I have tested cudaMalloc() and cudaHostAlloc() to allocate 3 int arrays in cuda sample code, they costs 120+ ms of time. I think that’s unacceptable. I don’t know if this is related to “peer mappings are not available”.
My machine specs:
system: windows 10 x64
cpu: i7 7700 hq
gpu: gtx 1060 6g