"Unified Memory Profiling is not supported ..." warning 3348

Every time I’m running NVIDIA Visual Profiler (Ver 8.0) the following warning appears on the console:

==3348== Warning: Unified Memory Profiling is not supported on the current configuration because a pair of devices without peer-to-peer support is detected on this multi-GPU setup. When peer mappings are not available, system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower. More details can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory

The corresponding memory profiling information is missing. That happens when I use the profiler to analyze my code or any CUDA sample application. But it seems to be only one visible GPU device on my PC, which is GeForce GTX 1070. And “[0] GeForce GTX 1070” is the only root node shown on the profiler screen. Invoking cudaGetDeviceCount also gets 1. But if the there is only ONE cuda visible GPU device, what is “a PAIR of devices without peer-to-peer support” that are mentioned in the warning? What is wrong with the profiler or possibly with my system configuration? How it can be fixed?

Thanks and Regards.

Hi anatoli,

It’s just a uvm warning and won’t block anything, the p2p only supports multi gpus which have the same architecture, I guess you have two different gpus in your computer, right?

Best Regards
Harry

Hi Harry,

I guess you have two different gpus in your computer, right?

Well, no. That’s what is confusing me. In the original message I explained that cudaGetDeviceCount gives 1. So there is only one CUDA visible GPU. On the other hand, there possibly is some other gpu built in msi motherboard, but it does not seem to be CUDA enabled.

Thanks,
Anatoli

Hi, anatoli

We have a internal bug that ‘Pascal do not support uvm profiling on Windows’.
Other profile features should work well.

This can explain the issue you met.
We’ll try to fix this ASAP.
Thanks for your reporting.

Hi veraj,

Thanks for clearing it to me.

Hi, I am also having the same problem on Windows 10 64 bit. I am using a GTX 1070. Is this bug resolved yet? I also have only 1 GPU installed on my computer.

Thanks!

Hi, libertast

Sorry the issue still exist for now.
Hope to be fixed in next release.

Anyway, you can use --unified-memory-profiling off to WAR this issue

I am getting this warning when using nvprof:
Warning: Unified Memory Profiling is not supported on devices of compute capability less than 3.0
However, its showing the profiling results which I doubt is correct.
I am new to cuda programming so just looking into sample codes.
In 1d stencil sample code on trying 3 different scenarios I am getting profiling number as:
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) < ( kernel_2<<<1,N>>> )

N is multiple of 32.
kernel_2 is using shared memory.

According to theories the order what I got is not correct & It should be exactly opposite.
Is there anyway from which I could get correct result?

Hi, ratan0612

The warning is fine. It just will not do unified memory profiling.

So what’s the exactly problem you mean ? You think the result is not correct ?

Can you clarify more clearly?

Hi Veraj,
Yes I think the profiling numbers which shows:
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) < ( kernel_2<<<1,N>>> )

is not correct.

here, N = 128.

I think Kernel_2 should be fastest as here shared memory is used.

Please correct me if my understanding is wrong.

Hello,

Could you share the sample you use? Under normal circumstances, shared memory is faster than global memory, but GPU has it’s cache, global memory operation will be optimized by cache and sometimes it’s even faster than shared memory if cache is enough.

Best Regards
Harry

#include “cuda_runtime.h”
#include “device_launch_parameters.h”
#include “cuda_profiler_api.h”
#include <stdio.h>
#define N 128

global void stencil(int *a, int b, int Nn){
//int i=threadIdx.x + blockIdx.x
blockDim.x;
shared int sa[128];
int i=threadIdx.x;
sa[i] = a[i];
__syncthreads();

if(i<Nn)
	b[i] = sa[i] + sa[i+1] + sa[i+2];

}

int main(){
int host_a[N], host_b[N];
int *dev_a, dev_b;
int count;
int size = N
sizeof(int);

for(int i=0;i<N;i++)
{
	host_a[i]=i;
}

//GPU Memory Allocation
cudaMalloc( &dev_a, size);
cudaMalloc( &dev_b, size);


//Host to Device Memory copy (copy inputs)
cudaMemcpy( dev_a, host_a, size, cudaMemcpyHostToDevice);
cudaMemset( dev_b, 0, size);

//Kernel call (computation)
stencil<<<1,N>>>(dev_a, dev_b, N-2);

//Device to Host Memory copy (copy results)
cudaMemcpy( host_b, dev_b, size, cudaMemcpyDeviceToHost);

for(int i=0;i<N-2;i++)
{
	printf("%3d: %3d +%3d +%3d = %4d\n",i, host_a[i], host_a[i+1], host_a[i+2], host_b[i]);
}

getchar();
return 0;

}

Above is the sample code I was talking about.
Your GPU cache explanation may be the reason for Kernel_2 but still
( kernel_1<<<1,N>>> ) < ( kernel_1<<<4,N/4>>> ) is not correct I feel.
should’nt it be exactly opposite.

I also have this problem, but what I care about is not the unified memory profiling, I’m care about it could cause “system falls back to using zero-copy memory. It can cause kernels, which access unified memory, to run slower.”(warning words). I have tested cudaMalloc() and cudaHostAlloc() to allocate 3 int arrays in cuda sample code, they costs 120+ ms of time. I think that’s unacceptable. I don’t know if this is related to “peer mappings are not available”.

My machine specs:
system: windows 10 x64
cpu: i7 7700 hq
gpu: gtx 1060 6g
ram: 32G

Hi, luan.dian

Would you please share your source code and explain your issue more clearly ?

I have the had the same problem,
My machine specs:
system: windows 10 x64
cpu: i7 7700 hq
gpu: gtx 1060 6g
ram: 16G

but today after updating to the last driver, the warning disappeared and the kernel started running x1000 times faster.