Hi,
I have been trying to write a double-buffering code (to overlap communication with computation) using CUDA streams on NVidia GPUs.
At a high-level the desired pseduo-code is as follows:
Stream_Setup()
{
Divide the GPU device memory in two equal buffers so as to alternate b/w compute and communication;
And associate the device buffer with two buffers in host memory (pinned memory);
And setup 2 cuda streams for doing async copy b/w the host buffers to the device buffers ;
}
Async_read_chunk_from_network();
Event_handle_for_Stream_chunk_execution_recieved_from_network(chunk)
{
As and when the chunks become available from the network in the main memory of host, I would like to give it one of the available
stream in the GPU device memory.
}
I have created 2 streams associating the device memory buffers, and when I ever I get chunk from the network - I launch the kernel with one of the available stream.
Now, I have the following question to make the code asynchronous completely.
In order to asynchronously assign the chunk from network to one of the streams, how would I find out:
Which streams are available without using cudaEventSynchronize(stop_event); (This forces me to synchronize).
thanks,
-pkb