Profiler error on Jetson TX1

I get a error on profiling any application with Jetson TX1:

inf5063-g15@tegra-6:~/NVIDIA_CUDA-8.0_Samples/bin/aarch64/linux/release$ nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==18746== NVPROF is profiling process 18746, command: ./matrixMul
==18746== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: “NVIDIA Tegra X1” with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==18746== Error: Internal profiling error 3755:999.
done
======== Error: CUDA profiling error.

Using the latest JetPack (R28.1), with the following version of nvprof: 8.0.84 (21)

Running the applications without the profiler works.

Hi, haaknonks

This is basic profile feature and should support.

Anyway, please try “nvprof --unified-memory-profiling off” and see if it works.

If it still not work, can you do below to check

  1. install cuda 8.0.84 toolkit on your host side by JetPack 3.1
  2. export VIPER_DEBUG=1
  3. execute “nvvp” to launch visual profiler
  4. remote connect to your Jetson TX1 and select any sdk sample like “vectorAdd”
  5. generate the timeline with default sessions
  6. please check if there are any error info printed in linux console and console tab in nvvp

THanks !

Hi, haaknonks

I tried on Jetson TX1 with latest Jetpack 3.1. Everything works normal.
There must be some set up issue at your side.

ubuntu@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/matrixMul$ /usr/local/cuda/bin/nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==2491== NVPROF is profiling process 2491, command: ./matrixMul
==2491== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: “NVIDIA Tegra X1” with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 32.09 GFlop/s, Time= 4.085 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==2491== Profiling application: ./matrixMul
==2491== Profiling result:
Time(%) Time Calls Avg Min Max Name
99.90% 1.24845s 301 4.1477ms 4.0517ms 23.748ms void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
0.09% 1.1605ms 2 580.24us 394.43us 766.05us [CUDA memcpy HtoD]
0.01% 84.283us 1 84.283us 84.283us 84.283us [CUDA memcpy DtoH]

==2491== API calls:
Time(%) Time Calls Avg Min Max Name
78.25% 1.20700s 1 1.20700s 1.20700s 1.20700s cudaEventSynchronize
18.76% 289.31ms 3 96.438ms 462.82us 288.39ms cudaMalloc
1.59% 24.561ms 1 24.561ms 24.561ms 24.561ms cudaDeviceSynchronize
1.02% 15.783ms 301 52.435us 35.521us 964.34us cudaLaunch
0.20% 3.0403ms 3 1.0134ms 574.17us 1.5553ms cudaMemcpy
0.07% 1.1471ms 1505 762ns 468ns 1.7700us cudaSetupArgument
0.04% 688.14us 3 229.38us 191.77us 271.25us cudaFree
0.02% 363.44us 2 181.72us 26.406us 337.04us cudaEventRecord
0.02% 339.12us 301 1.1260us 781ns 4.3750us cudaConfigureCall
0.01% 78.281us 91 860ns 468ns 18.438us cuDeviceGetAttribute
0.00% 25.626us 1 25.626us 25.626us 25.626us cudaGetDeviceProperties
0.00% 22.396us 2 11.198us 5.2080us 17.188us cudaEventCreate
0.00% 12.656us 1 12.656us 12.656us 12.656us cudaGetDevice
0.00% 10.469us 1 10.469us 10.469us 10.469us cudaEventElapsedTime
0.00% 5.1570us 1 5.1570us 5.1570us 5.1570us cuDeviceTotalMem
0.00% 4.7920us 3 1.5970us 833ns 2.8130us cuDeviceGetCount
0.00% 2.6040us 3 868ns 625ns 1.1460us cuDeviceGet
0.00% 1.1460us 1 1.1460us 1.1460us 1.1460us cuDeviceGetName