Hi all,
I would like to know if I can get a standalone version of the Visual Profiler which I can run on the Jetson Nano.
I dont have a host system for development. I only have jetson nano as host and target.
I would like to import and analyze the metrics reported by nvprof.
--analysis-metrics
Collect profiling data that can be imported to Visual Profiler's
"analysis" mode. Note: Use "--export-profile" to specify
an export file.
Thanks
Run into another problem:
The command freezes the system:
sudo /usr/local/cuda/bin/nvprof --export-profile profile.txt --analysis-metrics ./matrix_mul_gen_tiled6
! I cannot even collect performance metrics ?
Hi,
You can use nvprof or nsys on the Jetson platform directly.
For example, matrixMul example in our CUDA toolkit
$ /usr/local/cuda-10.2/bin/cuda-install-samples-10.2.sh .
$ cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul
$ make
$ sudo /usr/local/cuda-10.2/bin/nvprof ./matrixMul
[Matrix Multiply Using CUDA] - Starting...
==16817== NVPROF is profiling process 16817, command: ./matrixMul
==16817== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "Maxwell" with compute capability 5.3
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 29.65 GFlop/s, Time= 4.420 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16817== Profiling application: ./matrixMul
==16817== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.98% 1.32856s 301 4.4138ms 4.3783ms 4.4903ms void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
0.01% 128.87us 2 64.433us 45.525us 83.341us [CUDA memcpy HtoD]
0.01% 88.444us 1 88.444us 88.444us 88.444us [CUDA memcpy DtoH]
API calls: 72.87% 1.31418s 1 1.31418s 1.31418s 1.31418s cudaEventSynchronize
26.02% 469.30ms 3 156.43ms 979.34us 467.34ms cudaMalloc
0.69% 12.407ms 301 41.219us 31.094us 1.0969ms cudaLaunchKernel
0.26% 4.6541ms 2 2.3271ms 14.636us 4.6395ms cudaStreamSynchronize
0.12% 2.1583ms 3 719.42us 248.55us 1.4109ms cudaMemcpyAsync
0.03% 480.89us 3 160.30us 133.80us 209.22us cudaFree
0.01% 105.53us 97 1.0870us 572ns 24.949us cuDeviceGetAttribute
0.00% 62.137us 2 31.068us 25.105us 37.032us cudaEventRecord
0.00% 34.064us 1 34.064us 34.064us 34.064us cudaStreamCreateWithFlags
0.00% 27.761us 7 3.9650us 1.6670us 15.312us cudaDeviceGetAttribute
0.00% 13.125us 2 6.5620us 4.1150us 9.0100us cudaEventCreate
0.00% 10.156us 2 5.0780us 3.1250us 7.0310us cudaEventDestroy
0.00% 9.8440us 1 9.8440us 9.8440us 9.8440us cudaEventElapsedTime
0.00% 9.0110us 1 9.0110us 9.0110us 9.0110us cuDeviceTotalMem
0.00% 8.0730us 1 8.0730us 8.0730us 8.0730us cudaSetDevice
0.00% 6.7720us 3 2.2570us 1.2500us 3.4380us cuDeviceGetCount
0.00% 3.3850us 2 1.6920us 1.0410us 2.3440us cuDeviceGet
0.00% 2.3960us 1 2.3960us 2.3960us 2.3960us cuDeviceGetName
0.00% 1.5110us 1 1.5110us 1.5110us 1.5110us cudaGetDeviceCount
0.00% 937ns 1 937ns 937ns 937ns cuDeviceGetUuid
Thanks.
1 Like
Thanks.I was hoping for the visual profiler tool.
But in any case the nvprof tool says :
Warning: Unified Memory Profiling is not supported
So can I believe the metrics related to memory like gld efficiency or gld throughput. Sometimes I see strange values…