Magic of nvprof --profile-api-trace none

Hi all.

Why I get different results when profile with and without API trace option?

sudo /usr/local/cuda/bin/nvprof ./hello.exe 
==15147== NVPROF is profiling process 15147, command: ./hello.exe
==15147== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==15147== Profiling application: ./hello.exe
==15147== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.19%  2.3895ms         1  2.3895ms  2.3895ms  2.3895ms  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   27.02%  1.6902ms         1  1.6902ms  1.6902ms  1.6902ms  void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   17.06%  1.0671ms         1  1.0671ms  1.0671ms  1.0671ms  __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
                    7.80%  488.07us         1  488.07us  488.07us  488.07us  [CUDA memcpy DtoH]
                    5.94%  371.49us         1  371.49us  371.49us  371.49us  myfft_kernel1(creal_T*)
                    3.99%  249.92us         1  249.92us  249.92us  249.92us  [CUDA memcpy HtoD]
sudo /usr/local/cuda/bin/nvprof --profile-api-trace none ./hello.exe 
==15163== NVPROF is profiling process 15163, command: ./hello.exe
==15163== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==15163== Profiling application: ./hello.exe
==15163== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.12%  4.7814ms         1  4.7814ms  4.7814ms  4.7814ms  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   27.19%  3.4109ms         1  3.4109ms  3.4109ms  3.4109ms  void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   17.02%  2.1352ms         1  2.1352ms  2.1352ms  2.1352ms  __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
                    7.78%  975.51us         1  975.51us  975.51us  975.51us  [CUDA memcpy DtoH]
                    5.92%  742.62us         1  742.62us  742.62us  742.62us  myfft_kernel1(creal_T*)
                    3.98%  498.85us         1  498.85us  498.85us  498.85us  [CUDA memcpy HtoD]
No API activities were profiled.

Documentation states that API profiling adds some overhead, but two times looks very big overhead.
So, how to get real GPU activities?

Thanks.

Ok, if I iterate my function 100 times I get the following result

sudo /usr/local/cuda/bin/nvprof --profile-api-trace none ./hello.exe 
==8247== NVPROF is profiling process 8247, command: ./hello.exe
==8247== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==8247== Profiling application: ./hello.exe
==8247== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   38.08%  251.94ms       100  2.5194ms  2.3354ms  4.7867ms  void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   27.00%  178.59ms       100  1.7859ms  1.6604ms  3.3807ms  void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
                   17.13%  113.31ms       100  1.1331ms  1.0682ms  2.1391ms  __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
                    7.82%  51.738ms       100  517.38us  488.00us  975.64us  [CUDA memcpy DtoH]
                    5.96%  39.410ms       100  394.10us  369.61us  742.08us  myfft_kernel1(creal_T*)
                    4.01%  26.558ms       100  265.58us  249.55us  498.52us  [CUDA memcpy HtoD]
No API activities were profiled.

As I can see, for some runs I get significant performance drop (2 times). And the problem is not in profiling tool options, but in execution itself.
Where to dig, why for some runs I get performance drop?

Thanks, for any advice.

Ok, now its clear.
More simple vector add example from here
and --print-gpu-trace option shows the following

sudo /usr/local/cuda/bin/nvprof --print-gpu-trace ./add_cuda
==8190== NVPROF is profiling process 8190, command: ./add_cuda
==8190== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
Max error: 69
==8190== Profiling application: ./add_cuda
==8190== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
522.77ms  345.47ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [108]
870.99ms  232.64ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [110]
1.10629s  231.12ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [112]
1.33975s  223.68ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [114]
1.56576s  229.36ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [116]
1.79744s  235.94ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [118]
2.03568s  231.14ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [120]
2.26914s  232.31ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [122]
2.50415s  229.44ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [124]
2.73638s  231.36ms              (1 1 1)       (256 1 1)        13        0B        0B  NVIDIA Tegra X1         1         7  add(int, float*, float*) [126]

The first call includes memory management, all other calls do not, so that’s why execution time is different.

Hi,

Few points:

  1. Enabling CUDA API trace doesn’t affect the kernel or memcpy execution time. It seems you have identified issues in the execution itself.
    First kernel execution might take extra time due to the GPU warmup, caching etc. Launching a warmup kernel before timing the kernel can help.

  2. CUDA does lazy initialization of the CUDA context. That typically means that the first CUDA API call in a program will incur context initialization overhead. Using cudaFree(0) as the first CUDA call is a canonical way to force lazy context initialization.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.