Nsight System show events only in CUDA API

Running a simple pytorch program:

with torch.cuda.stream(user_stream_1):
    start_event.record(user_stream_1)
    r = dist.all_to_all_single(dst, src, group=my_pg_1, async_op=True)
    r.wait()
    end_event.record(user_stream_1)

And recording Nsight profile with:

nsys profile --trace=cuda,nvtx,mpi,ucx --stats=false -w true -o profile-%p.log python program.py

I am able to see the calls for cudaEventRecord in the CUDA API line of the Nsight report, however I don’t see the actual events recording on the stream itself, in the CUDA HW section.

if you look at that screen shot, see how there is a teal bar on the line with stream22? Open up that stream and zoom in and see if you can find the work there. The teal bar is an indication that there is activity on that line related to the current correlation, which may or many not currently be visible.

Hey,

When I select a NCCL kernel call in the CUDA API line, like ncclDevKernel_SendRecv here, it indeeds highlights its associated kernel in the GPU stream.

However, this doesn’t reproduce with CUDA events, I can only see the CPU call to cudaEventRecord

Note: this is true for every cudaEventRecord in the list, this one is taken just as an example

@liuyis thoughts?

Hi @eshukrun, to trace the device-side CUDA Event completions, you’ll need to add the --cuda-event-trace=true option.

Note that the feature is know to increase the possibility of false dependency among CUDA streams, and might cause deadlock for NCCL applications, so if the app behaves strangely you’ll have to disable it then.