We did not find information on the NVIDIA documentation about how to proceed with what is explained in that blog: http://cedric-augonnet.com/declaring-dependencies-with-cudastreamwaitevent/
We have an iterative application, where on each iteration we execute the same kernels and memory transfers, for different data.
In some parts of the code, before writing results into CPU, we execute some independent kernels, that can be executed in parallel (if there are resources). Some of them are very small and executing in parallel we skip the wait time between kernels.
To do so, we use a fork join strategy in which we have pre allocated N streams, besides the main one, and N*2 events.
The first N events are used to fork: we synchronize each of the N additional streams with the main stream, before anything enqueued in the additional streams executes.
The second set of N events is used to join: we make sure that anything enqueued in the main stream will not execute until everything in the N additional streams is executed.
We follow the strategy explained in the link above, except for the fact that we preallocate all the events and streams, and reuse them on each iteration.
Is this approach conceptually correct?
We had some cuda errors shown on NSIGHT, if we did the following:
1 record an event “A” on the main stream
2 enqueue a cudaStreamWaitEvent on each of the N streams, always using the same event “A”.
Is this approach incorrect? We have some other parts of the code that do that, and they don’t tirgger any errors on NSIGHT. The main difference is that the event is used with cudaStreamWaitEvent only with two or three streams (max(N)=3). In the first case we explained, N=13.