can Nsight show timelines of CPU execution without instrumenting code with NVTX?

Trying to see the overlap between CPU and GPU functions and what the CPU is doing when GPU is idle. Can Nsight show what functions the CPU executes or idle time without having to instrument using NVTX?

I imagine it would be fairly coarse resolution (every 1 ms), but that would be good enough.

Here’s what I tried so far:

  1. enable CPU profiling in Nsight - doesn’t do anything

same problem this guy had

  1. nvprof --cpu-profiling on, and compile with -g (good profilers will use DWARF for unwinding call stack) and -fno-omit-frame-pointer (simpler way to unwind call stack)

does show some call stack, but just numeric addresses

I conclude this might be impractical. A timeline based on sample based profiling @ 1 KHz will probably generate a timeline that’s too coarse and high variance to be useful. That could be improved with
instrumentation based profiling like GCC’s gprof, but that would introduce massive overhead and skew the results too much.

In theory, you could construct a timeline using sample based, low overhead, profiling, for the special case where the timeline is periodic (your program runs the same loop over & over and every iteration has the same timing). The wished for profiler would combine samples from different iterations to reduce varriance.