Device-side CUDA Event completion with multiple CUDA streams

Hi, I’m trying to trace an application with multiple CUDA streams. Running with the command line

mpirun -np xx ... nsys profile -t cuda, mpi ./app

gives me the following warning in the output file:

WARNING: Device-side CUDA Event completion trace is currently enabled.
         This may increase runtime overhead and the likelihood of false
         dependencies across CUDA Streams. If you wish to avoid this, please
         disable the feature with --cuda-event-trace=false.

What does device-side event completion mean and why would this cause false dependencies across streams?

I already notice a very large difference in runtime when traced with this feature enabled.

I’m going to have @liuyis chime in with specifics.

This warning is due to a corner case in the event completion trace, and is almost certainly not relevant to your run.

Hi @souza, the “device-side event completion trace” is related to the cudaEventRecord() calls. When cudaEventRecord() is called, you are recording a CUDA stream into an event. This trace feature just captures when all the recorded work finishes running on the stream (a.k.a. CUDA event completion), and will show a marker on the stream row corresponding to that.

The underlying trace mechanism is similar to what is used for CUDA Event’s own timing functionality, i.e. when cudaEventDisableTiming is NOT disabled at cudaEventCreate(). This mechanism can sometimes cause false dependency among CUDA streams.

If you noticed significant runtime change with it enabled, please disable the feature with --cuda-event-trace=false.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.