Possible duplicate entries in `cuda_kern_exec_trace` report

When I’m profiling a Python program on an Ubuntu target, I’m getting some entries in the cuda_kern_exec_trace that seem to be duplicates, which I suspect means I’m not fully understanding the report. For instance, this event from the “CUDA API” events view on thread 68487

image

seems to correspond with these two entries on the “CUDA kernel launch & Exec Time Trace” report

image

The JSON output shows that the timings are slightly different, so I suspect that these might be coming from different places

{
  "API Start (ns)": 3191126520,
  "API Dur (ns)": 37220,
  "Queue Start (ns)": "",
  "Queue Dur (ns)": "",
  "Kernel Start (ns)": 3191156902,
  "Kernel Dur (ns)": 3649,
  "Total Dur (ns)": 37220,
  "PID": 68487,
  "TID": 68487,
  "DevId": 0,
  "API Function": "cudaLaunchKernel",
  "GridXYZ": "   2    1  100",
  "BlockXYZ": " 128    1    1",
  "Kernel Name": "ampere_sgemm_32x32_sliced1x4_tn"
},
{
  "API Start (ns)": 3191126568,
  "API Dur (ns)": 36992,
  "Queue Start (ns)": "",
  "Queue Dur (ns)": "",
  "Kernel Start (ns)": 3191156902,
  "Kernel Dur (ns)": 3649,
  "Total Dur (ns)": 36992,
  "PID": 68487,
  "TID": 68487,
  "DevId": 0,
  "API Function": "cudaLaunchKernel",
  "GridXYZ": "   2    1  100",
  "BlockXYZ": " 128    1    1",
  "Kernel Name": "ampere_sgemm_32x32_sliced1x4_tn"
},

Is there something I’m accidentally enabling in the data collection that’s creating these duplicates, or is there a way to prevent them?

The flags I used were:

nsys profile --cuda-event-trace=true \
--cuda-flush-interval=10 \
--cuda-graph-trace=graph \
--cuda-memory-usage=true \
--cudabacktrace=all \
--gpu-metrics-devices=all \
--python-backtrace=cuda \
--python-sampling=true \
--python-sampling-frequency=2000 \
--trace=cuda,nvtx,cublas \
--trace-fork-before-exec=true 
1 Like

@guysz00 can you respond?

This is a significant issue for me, by the way– I’m comparing approaches to GPU profiling, which involves measuring the time a program spends in kernels rather precisely. If these entries are duplicates they’re difficult to automatically distinguish between distinct launches of a kernel with the same name and parameters. If they’re not duplicates and should be accounted for separately, I need to revisit the results of other profilers I’m looking at.

@guysz00 and @dofek

Maybe @liuyis can advise?

Is it possible to share the nsys-rep file for us to take a look?

Of course! Here’s the file

Thanks. That does look suspicious. I can only find one kernal launch API call on the timeline at that timestamp but there are two entries in the stats view. Both cuda_api_trace and cuda_kern_exec_trace reports have two entries for this API call. @jkreibich could you help here?

My suspicion (and hope) was that those were two signals coming from slightly different places that could easily be easily deduplicated by correcting my flags (which I still think is feasible– they’re not actually perfect duplicates if you read the raw JSON)

There are two runtime events, one for the actual kernel launch, and one for the back-trace. Both events are getting matched to the actual kernel (or, rather, the kernel is getting matched to them) and causing the duplicate kernel entries in the report.

$ nsys export --type=sqlite report5.nsys-rep 
Processing 539128 events: 
[==========================================================================100%]
$ sqlite3 report5.sqlite 
SQLite version 3.40.1 2022-12-28 14:03:47
Enter ".help" for usage hints. 
sqlite> .header on
sqlite> .mode column
sqlite> SELECT * 
   ...> FROM CUPTI_ACTIVITY_KIND_RUNTIME AS r
   ...> LEFT JOIN ENUM_NSYS_EVENT_CLASS AS e
   ...>   ON r.eventClass == e.id
   ...> LEFT JOIN StringIds AS n_str
   ...>   ON r.nameId == n_str.id
   ...> WHERE start >= 3191126520
   ...> LIMIT 2
   ...> ;

start       end         eventClass  globalTid        correlationId  nameId  returnValue  callchainId  id  name                               label           id    value                 
----------  ----------  ----------  ---------------  -------------  ------  -----------  -----------  --  ---------------------------------  --------------  ----  ----------------------
3191126520  3191163740  67          282623997971335  2097           4147    0            18           67  TRACE_PROCESS_EVENT_CUDABACKTRACE  CUDA backtrace  4147  cudaLaunchKernel      
3191126568  3191163560  0           282623997971335  2097           9300    0                         0   TRACE_PROCESS_EVENT_CUDA_RUNTIME   CUDA runtime    9300  cudaLaunchKernel_v7000

It looks like the cuda_kern_exec_trace report is old enough to assume any _RUNTIME record that matches a kernel is a launch. Unfortunately, as the report is currently written, there is no way to differentiate between the launch event and the backtrace event, other than the matching kernel times.

I will file a bug for this, but as a work around, find the file <nsys_install_dir>/reports/cuda_kern_exec_trace.py and find the block that looks like this, near line 100:

LEFT JOIN
    StringIds AS rname
    ON r.nameId == rname.id
ORDER BY 1

Edit the file, so it looks like this (add the WHERE… line):

LEFT JOIN
    StringIds AS rname
    ON r.nameId == rname.id
WHERE r.eventClass == 0
ORDER BY 1

I’m not 100% sure there won’t be other side-effects, but eventClass zero is “kernel launch” so it should get rid of the duplicates.

EDIT: Alternatively, if you don’t need the backtraces, turning those off should eliminate the duplicate matches.

This works perfectly for my use case! Super fun issue, I think I should consider doing things in the raw SQL in the future. Thanks so much for the solution!

If this is doing what you want, that’s great, but the eventual fix we ended up going with was not to limit things to eventClass 0, but to exclude call-chain events with the condition WHERE r.callchainId IS NULL. That’s the safer fix.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.