I was wondering if I could profile the device subroutines that are encountered inside my kernel. I’m programming in CUDA Fortran, and I have a kernel with a number of device subroutines that I call inside it like this:
attributes(global) subroutine kernel1() call sub1() call sub2() call sub3() end subroutine kernel1 attributes(device) subroutine sub1() ! do stuff end subroutine sub1 attributes(device) subroutine sub2() ! do more stuff end subroutine sub2 etc....
The problem is that when I profile this code using CUDA visual profiler, it doesn’t tell me anything about how much time is spent in the individual subroutines, it only tells me how much time is spent in the whole kernel.
Is it possible to profile these individual subroutines without invoking a new kernel for each one?