Can I profile individual subroutines inside a kernel using CUDA visual profiler?

I was wondering if I could profile the device subroutines that are encountered inside my kernel. I’m programming in CUDA Fortran, and I have a kernel with a number of device subroutines that I call inside it like this:

attributes(global) subroutine kernel1()

call sub1()

  call sub2()

  call sub3()

end subroutine kernel1

attributes(device) subroutine sub1()

  ! do stuff

end subroutine sub1

attributes(device) subroutine sub2()

  ! do more stuff

end subroutine sub2


The problem is that when I profile this code using CUDA visual profiler, it doesn’t tell me anything about how much time is spent in the individual subroutines, it only tells me how much time is spent in the whole kernel.

Is it possible to profile these individual subroutines without invoking a new kernel for each one?