Every time I’m running NVIDIA Visual Profiler (Ver 8.0) the following warning appears on the console:
==3348== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: Programming Guide :: CUDA Toolkit Documentation
The corresponding memory profiling information is missing. That happens when I use the profiler to analyze my code or any CUDA sample application. But it seems to be only one visible GPU device on my PC, which is GeForce GTX 1070. And “[0] GeForce GTX 1070” is the only root node shown on the profiler screen. Invoking cudaGetDeviceCount also gets 1. But if the there is only ONE cuda visible GPU device, what is “a PAIR of devices without peer-to-peer support” that are mentioned in the warning? What is wrong with the profiler or possibly with my system configuration? How it can be fixed?
It’s just a uvm warning and won’t block anything, the p2p only supports multi gpus which have the same architecture, I guess you have two different gpus in your computer, right?
I guess you have two different gpus in your computer, right?
Well, no. That’s what is confusing me. In the original message I explained that cudaGetDeviceCount gives 1. So there is only one CUDA visible GPU. On the other hand, there possibly is some other gpu built in msi motherboard, but it does not seem to be CUDA enabled.
Hi, I am also having the same problem on Windows 10 64 bit. I am using a GTX 1070. Is this bug resolved yet? I also have only 1 GPU installed on my computer.
I am getting this warning when using nvprof: Warning: Unified Memory Profiling is not supported on devices of compute capability less than 3.0
However, its showing the profiling results which I doubt is correct.
I am new to cuda programming so just looking into sample codes.
In 1d stencil sample code on trying 3 different scenarios I am getting profiling number as:
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) < ( kernel_2<<<1,N>>> )
N is multiple of 32.
kernel_2 is using shared memory.
According to theories the order what I got is not correct & It should be exactly opposite.
Is there anyway from which I could get correct result?
Could you share the sample you use? Under normal circumstances, shared memory is faster than global memory, but GPU has it’s cache, global memory operation will be optimized by cache and sometimes it’s even faster than shared memory if cache is enough.
global void stencil(int *a, int b, int Nn){
//int i=threadIdx.x + blockIdx.xblockDim.x; shared int sa[128];
int i=threadIdx.x;
sa[i] = a[i];
__syncthreads();
if(i<Nn)
b[i] = sa[i] + sa[i+1] + sa[i+2];
}
int main(){
int host_a[N], host_b[N];
int *dev_a, dev_b;
int count;
int size = Nsizeof(int);
Above is the sample code I was talking about.
Your GPU cache explanation may be the reason for Kernel_2 but still
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) is not correct I feel.
should’nt it be exactly opposite.
I also have this problem, but what I care about is not the unified memory profiling, I’m care about it could cause “system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower.”(warning words). I have tested cudaMalloc() and cudaHostAlloc() to allocate 3 int arrays in cuda sample code, they costs 120+ ms of time. I think that’s unacceptable. I don’t know if this is related to “peer mappings are not available”.
My machine specs:
system: windows 10 x64
cpu: i7 7700 hq
gpu: gtx 1060 6g
ram: 32G