Visual Profiler on Jetson Nano

Hi all,

I would like to know if I can get a standalone version of the Visual Profiler which I can run on the Jetson Nano.

I dont have a host system for development. I only have jetson nano as host and target.

I would like to import and analyze the metrics reported by nvprof.

   --analysis-metrics
                    Collect profiling data that can be imported to Visual Profiler's
                    "analysis" mode. Note: Use "--export-profile" to specify
                    an export file.

Thanks

Run into another problem:

The command freezes the system:

sudo /usr/local/cuda/bin/nvprof --export-profile profile.txt --analysis-metrics ./matrix_mul_gen_tiled6

! I cannot even collect performance metrics ?

Hi,

You can use nvprof or nsys on the Jetson platform directly.
For example, matrixMul example in our CUDA toolkit

$ /usr/local/cuda-10.2/bin/cuda-install-samples-10.2.sh .
$ cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul
$ make
$ sudo /usr/local/cuda-10.2/bin/nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
==16817== NVPROF is profiling process 16817, command: ./matrixMul
==16817== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "Maxwell" with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 29.65 GFlop/s, Time= 4.420 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16817== Profiling application: ./matrixMul
==16817== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.98%  1.32856s       301  4.4138ms  4.3783ms  4.4903ms  void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
                    0.01%  128.87us         2  64.433us  45.525us  83.341us  [CUDA memcpy HtoD]
                    0.01%  88.444us         1  88.444us  88.444us  88.444us  [CUDA memcpy DtoH]
      API calls:   72.87%  1.31418s         1  1.31418s  1.31418s  1.31418s  cudaEventSynchronize
                   26.02%  469.30ms         3  156.43ms  979.34us  467.34ms  cudaMalloc
                    0.69%  12.407ms       301  41.219us  31.094us  1.0969ms  cudaLaunchKernel
                    0.26%  4.6541ms         2  2.3271ms  14.636us  4.6395ms  cudaStreamSynchronize
                    0.12%  2.1583ms         3  719.42us  248.55us  1.4109ms  cudaMemcpyAsync
                    0.03%  480.89us         3  160.30us  133.80us  209.22us  cudaFree
                    0.01%  105.53us        97  1.0870us     572ns  24.949us  cuDeviceGetAttribute
                    0.00%  62.137us         2  31.068us  25.105us  37.032us  cudaEventRecord
                    0.00%  34.064us         1  34.064us  34.064us  34.064us  cudaStreamCreateWithFlags
                    0.00%  27.761us         7  3.9650us  1.6670us  15.312us  cudaDeviceGetAttribute
                    0.00%  13.125us         2  6.5620us  4.1150us  9.0100us  cudaEventCreate
                    0.00%  10.156us         2  5.0780us  3.1250us  7.0310us  cudaEventDestroy
                    0.00%  9.8440us         1  9.8440us  9.8440us  9.8440us  cudaEventElapsedTime
                    0.00%  9.0110us         1  9.0110us  9.0110us  9.0110us  cuDeviceTotalMem
                    0.00%  8.0730us         1  8.0730us  8.0730us  8.0730us  cudaSetDevice
                    0.00%  6.7720us         3  2.2570us  1.2500us  3.4380us  cuDeviceGetCount
                    0.00%  3.3850us         2  1.6920us  1.0410us  2.3440us  cuDeviceGet
                    0.00%  2.3960us         1  2.3960us  2.3960us  2.3960us  cuDeviceGetName
                    0.00%  1.5110us         1  1.5110us  1.5110us  1.5110us  cudaGetDeviceCount
                    0.00%     937ns         1     937ns     937ns     937ns  cuDeviceGetUuid

Thanks.

1 Like

Thanks.I was hoping for the visual profiler tool.
But in any case the nvprof tool says :

Warning: Unified Memory Profiling is not supported

So can I believe the metrics related to memory like gld efficiency or gld throughput. Sometimes I see strange values…