Hi, I’m trying to trace an application with multiple CUDA streams. Running with the command line
mpirun -np xx ... nsys profile -t cuda, mpi ./app
gives me the following warning in the output file:
WARNING: Device-side CUDA Event completion trace is currently enabled.
This may increase runtime overhead and the likelihood of false
dependencies across CUDA Streams. If you wish to avoid this, please
disable the feature with --cuda-event-trace=false.
What does device-side event completion mean and why would this cause false dependencies across streams?
I already notice a very large difference in runtime when traced with this feature enabled.
Hi @souza, the “device-side event completion trace” is related to the cudaEventRecord() calls. When cudaEventRecord() is called, you are recording a CUDA stream into an event. This trace feature just captures when all the recorded work finishes running on the stream (a.k.a. CUDA event completion), and will show a marker on the stream row corresponding to that.
The underlying trace mechanism is similar to what is used for CUDA Event’s own timing functionality, i.e. when cudaEventDisableTiming is NOT disabled at cudaEventCreate(). This mechanism can sometimes cause false dependency among CUDA streams.
If you noticed significant runtime change with it enabled, please disable the feature with --cuda-event-trace=false.