CUDA profiler is very helpful, but is there a way to get more detailed information from it? It shows function names and GPU time spend in it, # of stores/loads etc. But is there a way to find out how much time was spent in some parts inside the function or inside sub routines called from the function? Otherwise it looks more like guessing which parts of the code inside the function actually need to be optimized.
Thanks in advance.