NVProf error on samples

Hello,

I try to use NVPROF on the CUDA Sample.
When use it I get that error:

nvprof ./vectorAdd
[Vector addition of 50000 elements]
==5909== NVPROF is profiling process 5909, command: ./vectorAdd
==5909== Profiling application: ./vectorAdd
==5909== Profiling result:
No kernels were profiled.

==5909== API calls:
No API activities were profiled.
==5909== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

Is there a installation problems?

I still stuck with my nvprof problem.
I just add a deviceQuery information:

./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6071 MBytes (6365773824 bytes)
MapSMtoCores for SM 6.1 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 6.1 is undefined.  Default to use 128 Cores/SM
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Result = PASS

I try the same command on my other computer and the nvprof on example works properly…
I don’t understand why this don’t work on the first one …

switch to CUDA 8

Sorry for the delay between posts.
I just try to install the cuda 8.0 but it seems to get the same erro:

/usr/local/cuda-8.0/bin/nvprof ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

==6320== NVPROF is profiling process 6320, command: ./deviceQuery
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6069 MBytes (6363873280 bytes)
MapSMtoCores for SM 6.1 is undefined.  Default to use 128 Cores/SM
MapSMtoCores for SM 6.1 is undefined.  Default to use 128 Cores/SM
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Result = PASS
======== Error: unified memory profiling failed.

When I try on Nsight I get that error:

======== Error: Encountered invalid option : --openacc-profiling
======== Use "nvprof --help" to get more information.

Have you tried some other sample applications? I don’t think deviceQuery contains any kernels that could be profiled, it just queries the CUDA runtime/driver for configuration data. Also, heed the warning seen in your original posting:

What operating system are you using? Is it on the list of supported operating systems for the CUDA version you are using?

I just try with vectorAdd example and I get that:

/usr/local/cuda-8.0/bin/nvprof ./vectorAdd
[Vector addition of 50000 elements]
==1992== NVPROF is profiling process 1992, command: ./vectorAdd
======== Error: unified memory profiling failed.

So if I well undestand what you tell, I had to add cudaProfilerStop before the end of vectorAdd example?

I use Ubuntu 16.04.

Best case it will result in a profile, worst case it will do nothing. There seems to be no downside.

I haven’t had any issues with the CUDA 8 profiler yet, but then I am on Windows, not Ubuntu (even when on Linux, I try to stay as far away from Ubuntu as possible :-)

You seem to have some sort of mixed config on your machine:

CUDA Driver Version / Runtime Version          8.0 / 7.5

You should be using a CUDA 8 driver, CUDA 8 runtime, and nvprof from the same CUDA 8 install that you got the runtime from.

How to update the runtime?

Follow instructions in the linux install guide to install CUDA properly.

Here i the new deviceQuery info:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

==2681== NVPROF is profiling process 2681, command: ./deviceQuery
Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6069 MBytes (6363873280 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Result = PASS

The RunTime and the Driver seem OK. But I still have errors:

[Vector addition of 50000 elements]
==2896== NVPROF is profiling process 2896, command: ./vectorAdd
==2896== Profiling application: ./vectorAdd
==2896== Profiling result:
No kernels were profiled.

==2896== API calls:
No API activities were profiled.
==2896== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139

And when I try vectorAdd_nvrtc with cuProfilerStop(), I get that:

./vectorAdd_nvrtc: error while loading shared libraries: libnvrtc.so.8.0: cannot open shared object file: No such file or directory

Then I export the location of the libnvrtc library:

export LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib64

I get that:

nvprof ./vectorAdd_nvrtc 
==4579== NVPROF is profiling process 4579, command: ./vectorAdd_nvrtc
> Using CUDA Device [0]: GeForce GTX 1060 6GB
> GPU Device has SM 6.1 compute capability
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
==4579== Profiling application: ./vectorAdd_nvrtc
======== Error: Unable to import nvprof generated profile data.

And with the link of NVProf in cuda 8.0 get that:

/usr/local/cuda-8.0/bin/nvprof ./vectorAdd_nvrtc 
==4669== NVPROF is profiling process 4669, command: ./vectorAdd_nvrtc
> Using CUDA Device [0]: GeForce GTX 1060 6GB
> GPU Device has SM 6.1 compute capability
======== Error: unified memory profiling failed.

You didn’t follow the instructions in the linux install guide correctly.

It is necessary to set both the PATH and the LD_LIBRARY_PATH environment variables correctly.

http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#mandatory-post

The fact that you get a different result when you do:

nvprof …

and

/usr/local/cuda-8.0/bin/nvprof …

proves that your PATH environment variable is not set correctly for CUDA 8.0 (and that you have multiple CUDA versions installed on your machine, which is confusing you)

I’m getting the unified memory profiling error. For example, when I run it on the matrixMul sample, I get:

$ nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
==4956== NVPROF is profiling process 4956, command: ./matrixMul
GPU Device 0: "GeForce GTX 1080 Ti" with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
======== Error: unified memory profiling failed.

My device query looks good:

$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1080 Ti"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1

Runtime is fine and I have no issues doing debug build and run in Nsight. I also feel confident that my paths are set properly.

$ which nvprof
/usr/local/cuda-8.0/bin/nvprof
$ echo $PATH
/usr/local/cuda-8.0/bin

I had read somewhere that running with sudo would fix it but I just get:

$ sudo nvprof ./matrixMul
sudo: nvprof: command not found

Finally, as a test, I changed ownership of nvprof from root to my user and tried that with no success.

So, you’re running the profiler in the directory where the matrixMul sample code is?

Do you have write access to that directory?

If you don’t, that is the problem. Copy the executable to a directory where you have write access, then run the profiler on it there.

nvprof won’t run correctly as root user unless your root user has an appropriate PATH definition. That is the reason for the command not found error.

Yes, I’m running nvprof in the dir where matrixMul is.
Also, I do have write access to the dir and all files (for user); all dirs/files for all NVIDIA sample code are in my home dir. I hadn’t thought of ownership but this is an example of what my permissions look like for dir matrixMul.

drwxr-xr-x

Group and Other don’t have write but User does.

I don’t have any trouble running nvprof on matrixMul as long as nvprof is launched in a directory that I have write access to:

$ nvprof /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMul
[Matrix Multiply Using CUDA] - Starting...
==28488== NVPROF is profiling process 28488, command: /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMul
GPU Device 0: "TITAN X (Pascal)" with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 1231.66 GFlop/s, Time= 0.106 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==28488== Profiling application: /usr/local/cuda/samples/bin/x86_64/linux/release/matrixMul
==28488== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 99.48%  31.566ms       301  104.87us  103.75us  106.02us  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
  0.32%  102.15us         2  51.075us  34.274us  67.876us  [CUDA memcpy HtoD]
  0.20%  62.532us         1  62.532us  62.532us  62.532us  [CUDA memcpy DtoH]

==28488== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 88.48%  309.24ms         3  103.08ms  2.8910us  309.23ms  cudaMalloc
  8.59%  30.026ms         1  30.026ms  30.026ms  30.026ms  cudaEventSynchronize
  1.10%  3.8606ms       364  10.606us     244ns  460.14us  cuDeviceGetAttribute
  0.75%  2.6364ms         4  659.10us  643.85us  678.59us  cuDeviceTotalMem
  0.40%  1.3857ms       301  4.6030us  4.2080us  26.853us  cudaLaunch
  0.25%  875.08us         1  875.08us  875.08us  875.08us  cudaGetDeviceProperties
  0.17%  604.28us         3  201.43us  62.547us  415.22us  cudaMemcpy
  0.08%  274.53us         4  68.631us  63.785us  80.507us  cuDeviceGetName
  0.06%  220.92us      1505     146ns     132ns     607ns  cudaSetupArgument
  0.05%  159.77us         3  53.255us  3.9990us  144.14us  cudaFree
  0.04%  126.06us         1  126.06us  126.06us  126.06us  cudaDeviceSynchronize
  0.02%  62.585us       301     207ns     195ns  1.2970us  cudaConfigureCall
  0.00%  8.0260us         1  8.0260us  8.0260us  8.0260us  cudaGetDevice
  0.00%  5.6270us         2  2.8130us  2.0500us  3.5770us  cudaEventRecord
  0.00%  5.3660us        12     447ns     258ns     978ns  cuDeviceGet
  0.00%  2.8090us         3     936ns     285ns  1.9910us  cuDeviceGetCount
  0.00%  2.5730us         2  1.2860us     569ns  2.0040us  cudaEventCreate
  0.00%  2.0500us         1  2.0500us  2.0500us  2.0500us  cudaEventElapsedTime
$

CUDA 8.0.61 Ubuntu 14.04 Pascal Titan X

Strange. I even copied nvprof to the matrixMul dir. I have the following permissions:

$ ls -l
total 11688
-rw-r--r-- 1       9058 Jul  2 14:28 Makefile
-rwxrwxr-x 1     624288 Jul  5 22:39 matrixMul
-rw-r--r-- 1      13407 Jul  2 14:28 matrixMul.cu
-rw-rw-r-- 1      71472 Jul  5 22:37 matrixMul.o
-rw-r--r-- 1       2414 Jul  2 14:28 NsightEclipse.xml
-rwxr-xr-x 1   11227288 Jul  5 23:09 nvprof
-rw-r--r-- 1        580 Jul  2 14:28 readme.txt

I’ve removed the User/Group fields, but I did verify that I own them.

Then, when I try and run it locally:

$ ./nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
==6653== NVPROF is profiling process 6653, command: ./matrixMul
GPU Device 0: "GeForce GTX 1080 Ti" with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
======== Error: unified memory profiling failed.

no, you don’t want to copy nvprof anywhere

It needs to be in place where the CUDA installer put it, and it needs the PATH directory set up properly, and it needs to be invoked in a directory where you have write privileges

I don’t know what is wrong, exactly.

Yes, I just copied as a test.

nvprof is in

$ which nvprof
/usr/local/cuda-8.0/bin/nvprof

My PATH is set correctly; I have export lines in my ~/.bashrc, resulting in

$ echo $PATH
/usr/local/cuda-8.0/bin:[...]

And I believe I have no write/ownership issues with the sample dirs.

Now, just to clarify, I don’t need ownership and write-access to the bin dir that contains nvprof, correct? The dir where I do have write privileges and where I attempt to run it are the sample code dirs.

I did try turning the switch off for unified memory profiling and that allowed nvprof to proceed. Now, I’m just wondering why the unified mem profiling doesn’t work.

$ nvprof --unified-memory-profiling off ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
==2963== NVPROF is profiling process 2963, command: ./matrixMul
GPU Device 0: "GeForce GTX 1080 Ti" with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
[...]

I’ll play around with it some more and report back if I succeed. Thanks for the help, txbob.

Ubuntu 16.04.2 LTS | CUDA 8 | GTX 1080 Ti

I’m having the same exact problem as intra. Ubuntu 16, GTX 1080ti. My nvprof doesn’t work but it does when I turn off unified profiling. Same error messages.

Also for some reason, even when I try the manual mem copy and cudamalloc, it’s still as slow as managed malloc. I’m just adding two matrices together (say, A and B) and storing the result in A, the exact same code as from here: https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

That code just has two 1-million length vectors, and adds them together.

Here’s the profiler for the cudaMalloc + cudaMemcpy (sorry, I don’t know how to paste the code in a nice way like the other comment):

==2908== NVPROF is profiling process 2908, command: ./good_parallel
Max error: 0
==2908== Profiling application: ./good_parallel
==2908== Profiling result:
Time(%) Time Calls Avg Min Max Name
61.67% 999.63us 2 499.82us 496.18us 503.45us [CUDA memcpy HtoD]
36.20% 586.87us 1 586.87us 586.87us 586.87us [CUDA memcpy DtoH]
2.13% 34.497us 1 34.497us 34.497us 34.497us add(int, float*, float*)

==2908== API calls:
Time(%) Time Calls Avg Min Max Name
98.56% 222.70ms 2 111.35ms 415.33us 222.29ms cudaMalloc
0.88% 1.9970ms 3 665.67us 551.49us 809.86us cudaMemcpy
0.20% 451.83us 90 5.0200us 192ns 214.61us cuDeviceGetAttribute
0.16% 354.55us 1 354.55us 354.55us 354.55us cuDeviceTotalMem
0.12% 264.22us 2 132.11us 89.582us 174.64us cudaFree
0.04% 88.328us 1 88.328us 88.328us 88.328us cudaDeviceSynchronize
0.02% 48.071us 1 48.071us 48.071us 48.071us cuDeviceGetName
0.02% 37.264us 1 37.264us 37.264us 37.264us cudaLaunch
0.00% 9.0670us 3 3.0220us 148ns 8.2100us cudaSetupArgument
0.00% 3.0510us 2 1.5250us 846ns 2.2050us cuDeviceGetCount
0.00% 1.9160us 1 1.9160us 1.9160us 1.9160us cudaConfigureCall
0.00% 727ns 2 363ns 354ns 373ns cuDeviceGet

And here’s for managedMalloc:

==3360== Profiling application: ./unified
==3360== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 1.7025ms 1 1.7025ms 1.7025ms 1.7025ms add(int, float*, float*)

==3360== API calls:
Time(%) Time Calls Avg Min Max Name
98.57% 229.41ms 2 114.71ms 75.929us 229.34ms cudaMallocManaged
0.74% 1.7121ms 1 1.7121ms 1.7121ms 1.7121ms cudaDeviceSynchronize
0.28% 645.53us 1 645.53us 645.53us 645.53us cuDeviceTotalMem
0.21% 496.54us 90 5.5170us 297ns 212.46us cuDeviceGetAttribute
0.14% 319.68us 2 159.84us 158.89us 160.80us cudaFree
0.04% 87.306us 1 87.306us 87.306us 87.306us cudaLaunch
0.02% 49.088us 1 49.088us 49.088us 49.088us cuDeviceGetName
0.00% 5.7640us 3 1.9210us 164ns 5.1360us cudaSetupArgument
0.00% 2.6810us 2 1.3400us 386ns 2.2950us cuDeviceGetCount
0.00% 1.6650us 1 1.6650us 1.6650us 1.6650us cudaConfigureCall
0.00% 1.0040us 2 502ns 323ns 681ns cuDeviceGet

Both are running the same exact code except for the change in the malloc’ing.

Another thing, is that even though the cudaMalloc one says it’s taking micro-seconds in the kernel, it’s still slower than the serial code. I’m assuming this has to do with the transfer speed to the device? At what point is it supposed to speed up? Even increasing the size of the array seems to only increase the run time. Is there something this author is doing wrong?

edit: it might be because of how little computation is being done. If in the kernel I add a long inner loop, the parallel part starts to outweigh the data transfer speed, so I’m going to assume the data transfer speed is lagging.