Hello ! I introduced nvtx python package when I profiled my program using Nsight System so as to get performance breakdown. However, as show below, only can I found nvtx range marker in CPU thread. I found no nvtx range marker in CUDA HW but sometimes I could (in which single thread case). Would you have some suggestions for me ?
ours.nsys-rep (1.3 MB)
origin.nsys-rep (2.8 MB)