When I’m profiling a Python program on an Ubuntu target, I’m getting some entries in the cuda_kern_exec_trace that seem to be duplicates, which I suspect means I’m not fully understanding the report. For instance, this event from the “CUDA API” events view on thread 68487
![]()
seems to correspond with these two entries on the “CUDA kernel launch & Exec Time Trace” report
![]()
The JSON output shows that the timings are slightly different, so I suspect that these might be coming from different places
{
"API Start (ns)": 3191126520,
"API Dur (ns)": 37220,
"Queue Start (ns)": "",
"Queue Dur (ns)": "",
"Kernel Start (ns)": 3191156902,
"Kernel Dur (ns)": 3649,
"Total Dur (ns)": 37220,
"PID": 68487,
"TID": 68487,
"DevId": 0,
"API Function": "cudaLaunchKernel",
"GridXYZ": " 2 1 100",
"BlockXYZ": " 128 1 1",
"Kernel Name": "ampere_sgemm_32x32_sliced1x4_tn"
},
{
"API Start (ns)": 3191126568,
"API Dur (ns)": 36992,
"Queue Start (ns)": "",
"Queue Dur (ns)": "",
"Kernel Start (ns)": 3191156902,
"Kernel Dur (ns)": 3649,
"Total Dur (ns)": 36992,
"PID": 68487,
"TID": 68487,
"DevId": 0,
"API Function": "cudaLaunchKernel",
"GridXYZ": " 2 1 100",
"BlockXYZ": " 128 1 1",
"Kernel Name": "ampere_sgemm_32x32_sliced1x4_tn"
},
Is there something I’m accidentally enabling in the data collection that’s creating these duplicates, or is there a way to prevent them?
The flags I used were:
nsys profile --cuda-event-trace=true \
--cuda-flush-interval=10 \
--cuda-graph-trace=graph \
--cuda-memory-usage=true \
--cudabacktrace=all \
--gpu-metrics-devices=all \
--python-backtrace=cuda \
--python-sampling=true \
--python-sampling-frequency=2000 \
--trace=cuda,nvtx,cublas \
--trace-fork-before-exec=true
