Hi all.
Why I get different results when profile with and without API trace option?
sudo /usr/local/cuda/bin/nvprof ./hello.exe
==15147== NVPROF is profiling process 15147, command: ./hello.exe
==15147== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==15147== Profiling application: ./hello.exe
==15147== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 38.19% 2.3895ms 1 2.3895ms 2.3895ms 2.3895ms void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
27.02% 1.6902ms 1 1.6902ms 1.6902ms 1.6902ms void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
17.06% 1.0671ms 1 1.0671ms 1.0671ms 1.0671ms __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
7.80% 488.07us 1 488.07us 488.07us 488.07us [CUDA memcpy DtoH]
5.94% 371.49us 1 371.49us 371.49us 371.49us myfft_kernel1(creal_T*)
3.99% 249.92us 1 249.92us 249.92us 249.92us [CUDA memcpy HtoD]
sudo /usr/local/cuda/bin/nvprof --profile-api-trace none ./hello.exe
==15163== NVPROF is profiling process 15163, command: ./hello.exe
==15163== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
==15163== Profiling application: ./hello.exe
==15163== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 38.12% 4.7814ms 1 4.7814ms 4.7814ms 4.7814ms void regular_fft<unsigned int=128, unsigned int=8, unsigned int=16, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
27.19% 3.4109ms 1 3.4109ms 3.4109ms 3.4109ms void regular_fft<unsigned int=256, unsigned int=16, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>)
17.02% 2.1352ms 1 2.1352ms 2.1352ms 2.1352ms __nv_static_43__30_RealComplex_compute_75_cpp1_ii_b2d354f6__Z24postprocessC2C_kernelMemIjdL9fftAxii_t1EEvP7ComplexIT0_EPKS3_T_15coordDivisors_tIS7_E7coord_tIS7_ESB_S7_S2_10callback_tmb
7.78% 975.51us 1 975.51us 975.51us 975.51us [CUDA memcpy DtoH]
5.92% 742.62us 1 742.62us 742.62us 742.62us myfft_kernel1(creal_T*)
3.98% 498.85us 1 498.85us 498.85us 498.85us [CUDA memcpy HtoD]
No API activities were profiled.
Documentation states that API profiling adds some overhead, but two times looks very big overhead.
So, how to get real GPU activities?
Thanks.