What's the best way to handle many async transfers ? streams vs. events / driver design issues

gonnet · February 3, 2010, 10:56am

Hi everybody,

As the number of processing units keeps increasing, and that (some) people also realized that wasting CPUs is not the way to go, it is getting more and more important to handle correctly the interactions between the different GPUs, or between CPUs and GPUs. It is quite clear that the best option to hide such overhead is to overlap communications and computations, and more generally to use asynchronous mechanisms whenever possible.
Unfortunately, the CUDA driver only offers very limited asynchronous capabilities: the number of async transfers that the driver can handle at the same time is limited, and the number of requests seems to be unpredictible (in a reliable way). When the application submits asynchronous requests while there is no more slot, the non-blocking call magically becomes blocking :)

Now here is my problem, say that i have a lot of data transfers, and that i need to query whether a specific transfer is terminated or not. The problem with streams is that it’s a very coarse grain synchronization medium: you can only check whether all transfers are terminated or not. So i have 2 solutions at the moment: however i don’t know which one is the best, and should remain the best in the future.

** Solution 1 **

I can create one stream per data transfer, and synchronize directly with the stream which i can destroy once the data transfer is performed. This looks really costly, according to some quick’n’dirty microbenchmarks, and we must check all transfers independently so that such an algorithm may not scale very well with the number of transfers.

cudaStreamCreate 4.51 us
cudaStreamQuery 0.10 us
cudaStreamSynchronize 0.29 us
cudaThreadSynchronize 0.26 us
cudaStreamDestroy 0.19 us

And for the record, with the driver API we have
cuStreamCreate 0.31 us << note the HUGE difference with runtime API.
cuStreamQuery 0.25 us
cuStreamSynchronize 0.29 us
cuCtxSynchronize 0.25 us
cuStreamDestroy 0.30 us

** Solution 2 **

Another synchronization mean is the API based on cudaEvent_t . One solution would be that when we submit an asynchronous data transfer to a (unique) stream, we then submit an event in the same stream, and we query/synchronize with the event to check whether the associated data transfer is terminated or not. This event serialization is implicit in the documentation, since the event-based timing mechanisms documented would not work otherwise. This approach is also more scalable as we can stop checking whether the different data transfers are terminated as soon as we have an event that is not terminated.

On the paper, events look like much lighter to initialize than streams, but according to microbenchmarks on the same machine … I have very similar overheads:
cudaEventCreate 4.54 us
cudaEventDestroy 0.39 us

Also, by submitting an event every time i submit an asynchronous request, i have twice more items in the stream, and the “non-blocking” functions are even more likely to become blocking. So this approach looks more scalable, and should be much lighter, but the current CUDA driver implementation looks like to have some design issues.

** At last, my questions :) **

I guess i’m not the only person dealing with such asynchronous data transfers, so i suppose that other people have met similar issues, what are your approaches ?
CUDA 3.0 is expected soon, and should bring us the support for Fermi. Programming Fermi in a totally synchronous way looks like a terrible idea, and its design should naturally force people to think in a more asynchronous way (at last !!), could anyone who has some insights in the driver design tell me if there is one specific approach that should be successful in the future. I can imagine that the problem of functions silently becoming blocking will be fixed at some point, and i can imagine that events will become lighter than streams… What do you people see as the best synchronization mechanism ?

At a rather long term, having a single stream with serialized requests (events) will be limited since it won’t let the driver reorder data transfers (and report such optimization to the user). Such a design will fail when we have a good DMA controller, so what is the best way to handle many asynchronous requests in CUDA now and in a way that should make sense in the future ?

Topic		Replies	Views
a problem about the asynchronous mechanism and stream CUDA Programming and Performance	0	2568	December 5, 2008
Asynchronous Stream Synchronize CUDA Programming and Performance	2	1229	June 30, 2012
asyncAPI sample question CUDA Programming and Performance	9	5122	December 18, 2007
a question about the asynchronous mechanism and stream CUDA Programming and Performance	3	1921	December 10, 2008
Concurrent Kernel Execution / Memory Transfer We can't get it to work... CUDA Programming and Performance	5	4060	March 21, 2009
Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU CUDA Programming and Performance	6	1057	September 9, 2010
First Set of Commands in Set of Streams not Asynchronous? CUDA Programming and Performance	0	2751	August 25, 2010
Async questions Kernels appear to stall host threads CUDA Programming and Performance	3	2303	January 20, 2008
cuda stream CUDA Programming and Performance	3	5863	April 6, 2011
Asyncronus call CUDA Programming and Performance	1	2293	September 24, 2009

What's the best way to handle many async transfers ? streams vs. events / driver design issues

Related topics