About nsys trace report

Hi,
While profiling an application with nsight systems, I see that the top kernel (took most of the GPU time) is vectorized_elementwise_kernel.

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name 
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    15.5    5,088,208,806    320,418     15,879.9        807    393,269  void at::native::vectorized_elementwise_kernel<4, at::native::AddFunctor<float>, at::detail::Array<…

This kernel belongs to ATen package and the implementation according is

   64 __global__ void vectorized_elementwise_kernel(int N, func_t f, array_t data) {
   65   using traits = function_traits<func_t>;
   66   int remaining = N - block_work_size * blockIdx.x;
   67 
   68   if (remaining < block_work_size) {  // if this block handles the reminder, just do a naive unrolled loop
   69     auto input_calc = TrivialOffsetCalculator<traits::arity>();
   70     auto output_calc = TrivialOffsetCalculator<1>();
   71     auto loader = memory::LoadWithoutCast();
   72     auto storer = memory::StoreWithoutCast();
   73     auto policy = memory::policies::unroll<array_t, decltype(input_calc), decltype(output_calc),
   74                                            memory::LoadWithoutCast, memory::StoreWithoutCast>(
   75       data, remaining, input_calc, output_calc, loader, storer);
   76     elementwise_kernel_helper(f, policy);
   77   } else {  // if this block has a full `block_work_size` data to handle, use vectorized memory access
   78     elementwise_kernel_helper(f, memory::policies::vectorized<vec_size, array_t>(data));
   79   }
   80 }

As can be seen this kernel calls some other functions. My guess is that the profiler sums up all the time for the functions inside this kernel. I mean, if LoadWithoutCast() takes 1 second and elementwise_kernel_helper() takes 2 seconds, then profiles says vectorized_elementwise_kernel takes 3 seconds.
Is that correct?