multi-buffering using stream Cuda programming


I have been trying to write a double-buffering code (to overlap communication with computation) using CUDA streams on NVidia GPUs.

At a high-level the desired pseduo-code is as follows:


   Divide the GPU device memory in two equal buffers so as to alternate b/w compute and communication;
   And associate the device buffer with two buffers in host memory (pinned memory); 

   And setup 2 cuda streams for doing async copy b/w the host buffers to the device buffers ;



As and when the chunks become available from the network in the main memory of host, I would like to give it one of the available
stream in the GPU device memory.


I have created 2 streams associating the device memory buffers, and when I ever I get chunk from the network - I launch the kernel with one of the available stream.
Now, I have the following question to make the code asynchronous completely.

In order to asynchronously assign the chunk from network to one of the streams, how would I find out:
Which streams are available without using cudaEventSynchronize(stop_event); (This forces me to synchronize).


I’m not sure if I understand your question correctly but maybe [font=“Courier New”]cudaStreamQuery(cudaStream_t stream)[/font] is what you are looking for. From the reference manual:

Returns cudaSuccess if all operations in stream have completed, or cudaErrorNotReady if not.

Parameters: stream - Stream identifier

Returns: cudaSuccess, cudaErrorNotReady cudaErrorInvalidResourceHandle