I get a error on profiling any application with Jetson TX1:
inf5063-g15@tegra-6:~/NVIDIA_CUDA-8.0_Samples/bin/aarch64/linux/release$ nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==18746== NVPROF is profiling process 18746, command: ./matrixMul
==18746== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: Programming Guide :: CUDA Toolkit Documentation
GPU Device 0: “NVIDIA Tegra X1” with compute capability 5.3
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==18746== Error: Internal profiling error 3755:999.
done
======== Error: CUDA profiling error.
Using the latest JetPack (R28.1), with the following version of nvprof: 8.0.84 (21)
Running the applications without the profiler works.
I tried on Jetson TX1 with latest Jetpack 3.1. Everything works normal.
There must be some set up issue at your side.
ubuntu@tegra-ubuntu:/usr/local/cuda/samples/0_Simple/matrixMul$ /usr/local/cuda/bin/nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting…
==2491== NVPROF is profiling process 2491, command: ./matrixMul
==2491== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: Programming Guide :: CUDA Toolkit Documentation
GPU Device 0: “NVIDIA Tegra X1” with compute capability 5.3
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 32.09 GFlop/s, Time= 4.085 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==2491== Profiling application: ./matrixMul
==2491== Profiling result:
Time(%) Time Calls Avg Min Max Name
99.90% 1.24845s 301 4.1477ms 4.0517ms 23.748ms void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
0.09% 1.1605ms 2 580.24us 394.43us 766.05us [CUDA memcpy HtoD]
0.01% 84.283us 1 84.283us 84.283us 84.283us [CUDA memcpy DtoH]