We have a custom script that analyzes output of
nvprof --print-api-trace --print-gpu-trace <our application>
We use NVTX ranges to associate GPU work with API calls. Right now all the push/pop range and CUDA API calls are happening on a single thread, and this has been working great. When a range starts, any API call that happens before the range ends correctly gets associated with the corresponding GPU work.
But now we have some work on another thread that we want to profile. Simply adding push/pop range to that thread doesn’t work, because there is no longer strict ordering between the start/end events. An end event from one thread may happen immediately after the start event from the other thread.
So I was thinking I could use different profiling domains. I added a new domain for the separate thread, but then ran into a new problem. While I’m able to see the domain for the start/end events, the API, GPU, and data transfer events have no indication whatsoever of what thread they were called on. This means we have no way of correctly assigning GPU/API time to the approprate profiling range.
I looked through the command line options for nvprof and the only one that seemed related was --cpu-thread-tracing but that just seems to trace calls to threading APIs, it doesn’t add any information to the rest of the trace data.
Is there any way we can get around this problem?