Magic of nvprof --profile-api-trace none

maxman · March 9, 2023, 1:55pm

Hi all.

Why I get different results when profile with and without API trace option?

sudo /usr/local/cuda/bin/nvprof ./hello.exe 
==15147== NVPROF is profiling process 15147, command: ./hello.exe
==15147== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==15147== Profiling application: ./hello.exe
==15147== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.19%  2.3895ms         1  2.3895ms  2.3895ms  2.3895ms  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   27.02%  1.6902ms         1  1.6902ms  1.6902ms  1.6902ms  void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   17.06%  1.0671ms         1  1.0671ms  1.0671ms  1.0671ms  __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
                    7.80%  488.07us         1  488.07us  488.07us  488.07us  [CUDA memcpy DtoH]
                    5.94%  371.49us         1  371.49us  371.49us  371.49us  myfft_kernel1(creal_T*)
                    3.99%  249.92us         1  249.92us  249.92us  249.92us  [CUDA memcpy HtoD]

sudo /usr/local/cuda/bin/nvprof --profile-api-trace none ./hello.exe 
==15163== NVPROF is profiling process 15163, command: ./hello.exe
==15163== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==15163== Profiling application: ./hello.exe
==15163== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.12%  4.7814ms         1  4.7814ms  4.7814ms  4.7814ms  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   27.19%  3.4109ms         1  3.4109ms  3.4109ms  3.4109ms  void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   17.02%  2.1352ms         1  2.1352ms  2.1352ms  2.1352ms  __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
                    7.78%  975.51us         1  975.51us  975.51us  975.51us  [CUDA memcpy DtoH]
                    5.92%  742.62us         1  742.62us  742.62us  742.62us  myfft_kernel1(creal_T*)
                    3.98%  498.85us         1  498.85us  498.85us  498.85us  [CUDA memcpy HtoD]
No API activities were profiled.

Documentation states that API profiling adds some overhead, but two times looks very big overhead.
So, how to get real GPU activities?

Thanks.

maxman · March 10, 2023, 7:06am

Ok, if I iterate my function 100 times I get the following result

sudo /usr/local/cuda/bin/nvprof --profile-api-trace none ./hello.exe 
==8247== NVPROF is profiling process 8247, command: ./hello.exe
==8247== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==8247== Profiling application: ./hello.exe
==8247== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.08%  251.94ms       100  2.5194ms  2.3354ms  4.7867ms  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   27.00%  178.59ms       100  1.7859ms  1.6604ms  3.3807ms  void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   17.13%  113.31ms       100  1.1331ms  1.0682ms  2.1391ms  __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
                    7.82%  51.738ms       100  517.38us  488.00us  975.64us  [CUDA memcpy DtoH]
                    5.96%  39.410ms       100  394.10us  369.61us  742.08us  myfft_kernel1(creal_T*)
                    4.01%  26.558ms       100  265.58us  249.55us  498.52us  [CUDA memcpy HtoD]
No API activities were profiled.

As I can see, for some runs I get significant performance drop (2 times). And the problem is not in profiling tool options, but in execution itself.
Where to dig, why for some runs I get performance drop?

Thanks, for any advice.

maxman · March 13, 2023, 6:50am

Ok, now its clear.
More simple vector add example from here
and --print-gpu-trace option shows the following

sudo /usr/local/cuda/bin/nvprof --print-gpu-trace ./add_cuda
==8190== NVPROF is profiling process 8190, command: ./add_cuda
==8190== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Max error: 69
==8190== Profiling application: ./add_cuda
==8190== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
522.77ms  345.47ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [108]
870.99ms  232.64ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [110]
1.10629s  231.12ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [112]
1.33975s  223.68ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [114]
1.56576s  229.36ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [116]
1.79744s  235.94ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [118]
2.03568s  231.14ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [120]
2.26914s  232.31ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [122]
2.50415s  229.44ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [124]
2.73638s  231.36ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [126]

The first call includes memory management, all other calls do not, so that’s why execution time is different.

mjain · March 13, 2023, 8:22am

Hi,

Few points:

Enabling CUDA API trace doesn’t affect the kernel or memcpy execution time. It seems you have identified issues in the execution itself.
First kernel execution might take extra time due to the GPU warmup, caching etc. Launching a warmup kernel before timing the kernel can help.
CUDA does lazy initialization of the CUDA context. That typically means that the first CUDA API call in a program will incur context initialization overhead. Using cudaFree(0) as the first CUDA call is a canonical way to force lazy context initialization.

system · March 27, 2023, 8:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NVProf error on samples CUDA Programming and Performance	28	20430	December 29, 2020
nvprof error code 139 but memcheck OK Visual Profiler and nvprof	14	13768	December 11, 2020
Profiler error on Jetson TX1 Visual Profiler and nvprof	2	1985	October 23, 2017
CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler Technical Blog	35	2452	September 5, 2021
unified memory profiling failed Visual Profiler and nvprof	12	6105	June 17, 2018
NVPROF is causing system instability and requiring reboot CUDA Programming and Performance	8	915	February 19, 2018
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2552	January 31, 2019
NV Visual Profiler: No GPU devices in session CUDA Programming and Performance	8	4395	March 11, 2015
nvprof never returns CUDA Programming and Performance	8	6297	March 30, 2016
Low or normal performance? CUDA Programming and Performance cuda	20	1224	November 13, 2020

Magic of nvprof --profile-api-trace none

Related topics