Hi,
While profiling an application with nsight systems, I see that the top kernel (took most of the GPU time) is vectorized_elementwise_kernel.
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ----------- --------- --------- ----------------------------------------------------------------------------------------------------
15.5 5,088,208,806 320,418 15,879.9 807 393,269 void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…
This kernel belongs to ATen package and the implementation according is
64 __global__ void vectorized_elementwise_kernel(int N, func_t f, array_t data) {
65 using traits = function_traits<func_t>;
66 int remaining = N - block_work_size * blockIdx.x;
67
68 if (remaining < block_work_size) { // if this block handles the reminder, just do a naive unrolled loop
69 auto input_calc = TrivialOffsetCalculator<traits::arity>();
70 auto output_calc = TrivialOffsetCalculator<1>();
71 auto loader = memory::LoadWithoutCast();
72 auto storer = memory::StoreWithoutCast();
73 auto policy = memory::policies::unroll<array_t, decltype(input_calc), decltype(output_calc),
74 memory::LoadWithoutCast, memory::StoreWithoutCast>(
75 data, remaining, input_calc, output_calc, loader, storer);
76 elementwise_kernel_helper(f, policy);
77 } else { // if this block has a full `block_work_size` data to handle, use vectorized memory access
78 elementwise_kernel_helper(f, memory::policies::vectorized<vec_size, array_t>(data));
79 }
80 }
As can be seen this kernel calls some other functions. My guess is that the profiler sums up all the time for the functions inside this kernel. I mean, if LoadWithoutCast() takes 1 second and elementwise_kernel_helper() takes 2 seconds, then profiles says vectorized_elementwise_kernel takes 3 seconds.
Is that correct?