Visual Profiler on Jetson Nano

rreddy78 · November 27, 2020, 2:50pm

Hi all,

I would like to know if I can get a standalone version of the Visual Profiler which I can run on the Jetson Nano.

I dont have a host system for development. I only have jetson nano as host and target.

I would like to import and analyze the metrics reported by nvprof.

   --analysis-metrics
                    Collect profiling data that can be imported to Visual Profiler's
                    "analysis" mode. Note: Use "--export-profile" to specify
                    an export file.

Thanks

rreddy78 · November 27, 2020, 3:27pm

Run into another problem:

The command freezes the system:

sudo /usr/local/cuda/bin/nvprof --export-profile profile.txt --analysis-metrics ./matrix_mul_gen_tiled6

! I cannot even collect performance metrics ?

AastaLLL · December 2, 2020, 2:47am

Hi,

You can use nvprof or nsys on the Jetson platform directly.
For example, matrixMul example in our CUDA toolkit

$ /usr/local/cuda-10.2/bin/cuda-install-samples-10.2.sh .
$ cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul
$ make

$ sudo /usr/local/cuda-10.2/bin/nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
==16817== NVPROF is profiling process 16817, command: ./matrixMul
==16817== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "Maxwell" with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 29.65 GFlop/s, Time= 4.420 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16817== Profiling application: ./matrixMul
==16817== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.98%  1.32856s       301  4.4138ms  4.3783ms  4.4903ms  void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
                    0.01%  128.87us         2  64.433us  45.525us  83.341us  [CUDA memcpy HtoD]
                    0.01%  88.444us         1  88.444us  88.444us  88.444us  [CUDA memcpy DtoH]
      API calls:   72.87%  1.31418s         1  1.31418s  1.31418s  1.31418s  cudaEventSynchronize
                   26.02%  469.30ms         3  156.43ms  979.34us  467.34ms  cudaMalloc
                    0.69%  12.407ms       301  41.219us  31.094us  1.0969ms  cudaLaunchKernel
                    0.26%  4.6541ms         2  2.3271ms  14.636us  4.6395ms  cudaStreamSynchronize
                    0.12%  2.1583ms         3  719.42us  248.55us  1.4109ms  cudaMemcpyAsync
                    0.03%  480.89us         3  160.30us  133.80us  209.22us  cudaFree
                    0.01%  105.53us        97  1.0870us     572ns  24.949us  cuDeviceGetAttribute
                    0.00%  62.137us         2  31.068us  25.105us  37.032us  cudaEventRecord
                    0.00%  34.064us         1  34.064us  34.064us  34.064us  cudaStreamCreateWithFlags
                    0.00%  27.761us         7  3.9650us  1.6670us  15.312us  cudaDeviceGetAttribute
                    0.00%  13.125us         2  6.5620us  4.1150us  9.0100us  cudaEventCreate
                    0.00%  10.156us         2  5.0780us  3.1250us  7.0310us  cudaEventDestroy
                    0.00%  9.8440us         1  9.8440us  9.8440us  9.8440us  cudaEventElapsedTime
                    0.00%  9.0110us         1  9.0110us  9.0110us  9.0110us  cuDeviceTotalMem
                    0.00%  8.0730us         1  8.0730us  8.0730us  8.0730us  cudaSetDevice
                    0.00%  6.7720us         3  2.2570us  1.2500us  3.4380us  cuDeviceGetCount
                    0.00%  3.3850us         2  1.6920us  1.0410us  2.3440us  cuDeviceGet
                    0.00%  2.3960us         1  2.3960us  2.3960us  2.3960us  cuDeviceGetName
                    0.00%  1.5110us         1  1.5110us  1.5110us  1.5110us  cudaGetDeviceCount
                    0.00%     937ns         1     937ns     937ns     937ns  cuDeviceGetUuid

Thanks.

rreddy78 · December 7, 2020, 4:09pm

Thanks.I was hoping for the visual profiler tool.
But in any case the nvprof tool says :

Warning: Unified Memory Profiling is not supported

So can I believe the metrics related to memory like gld efficiency or gld throughput. Sometimes I see strange values…

Topic		Replies	Views
unable to profile on jetson nano from host nsight Visual Profiler and nvprof	11	7640	June 7, 2019
Debug + profiler python application on Jetson Nano Visual Profiler and nvprof	2	1181	March 1, 2022
Can I profile an application on jetson nano using NVIDIA Visual profiler? Jetson Nano profiling	2	573	February 1, 2023
Remote profiling Jetson Nano not working from NVIDIA Visiual Profiler Jetson Nano	9	2028	October 15, 2021
Nvidia Visual Profiler - Data collection for 1 analysis stages failed Visual Profiler and nvprof	0	705	July 15, 2020
Profiling Applications running on Jetson Nano Jetson Nano	6	3339	October 18, 2021
Nvidia Visual Profiler: Analysis Data Collection Failed Jetson Nano deep-learning-profiler	5	1013	October 15, 2021
Nvprof and visual profiler about memory and cache access？ Jetson Nano nsight	10	2139	March 31, 2022
Jetson Nano CUDA remote profiling using Visual Profiler & Nsight Systems Profiling Embedded Targets	5	1675	November 5, 2021
Profiler error on Jetson TX1 Visual Profiler and nvprof	2	2036	October 23, 2017

Visual Profiler on Jetson Nano

Related topics