I have been trying to write a double-buffering code (to overlap communication with computation) using CUDA streams on NVidia GPUs.
At a high-level the desired pseduo-code is as follows:
Divide the GPU device memory in two equal buffers so as to alternate b/w compute and communication; And associate the device buffer with two buffers in host memory (pinned memory); And setup 2 cuda streams for doing async copy b/w the host buffers to the device buffers ;
As and when the chunks become available from the network in the main memory of host, I would like to give it one of the available
stream in the GPU device memory.
I have created 2 streams associating the device memory buffers, and when I ever I get chunk from the network - I launch the kernel with one of the available stream.
Now, I have the following question to make the code asynchronous completely.
In order to asynchronously assign the chunk from network to one of the streams, how would I find out:
Which streams are available without using cudaEventSynchronize(stop_event); (This forces me to synchronize).