I was wondering if I could profile the device subroutines that are encountered inside my kernel. I’m programming in CUDA Fortran, and I have a kernel with a number of device subroutines that I call inside it like this:
attributes(global) subroutine kernel1()
call sub1()
call sub2()
call sub3()
end subroutine kernel1
attributes(device) subroutine sub1()
! do stuff
end subroutine sub1
attributes(device) subroutine sub2()
! do more stuff
end subroutine sub2
etc....
The problem is that when I profile this code using CUDA visual profiler, it doesn’t tell me anything about how much time is spent in the individual subroutines, it only tells me how much time is spent in the whole kernel.
Is it possible to profile these individual subroutines without invoking a new kernel for each one?
Thanks