What's the best way to handle many async transfers ? streams vs. events / driver design issues

Hi everybody,

As the number of processing units keeps increasing, and that (some) people also realized that wasting CPUs is not the way to go, it is getting more and more important to handle correctly the interactions between the different GPUs, or between CPUs and GPUs. It is quite clear that the best option to hide such overhead is to overlap communications and computations, and more generally to use asynchronous mechanisms whenever possible.
Unfortunately, the CUDA driver only offers very limited asynchronous capabilities: the number of async transfers that the driver can handle at the same time is limited, and the number of requests seems to be unpredictible (in a reliable way). When the application submits asynchronous requests while there is no more slot, the non-blocking call magically becomes blocking :)

Now here is my problem, say that i have a lot of data transfers, and that i need to query whether a specific transfer is terminated or not. The problem with streams is that it’s a very coarse grain synchronization medium: you can only check whether all transfers are terminated or not. So i have 2 solutions at the moment: however i don’t know which one is the best, and should remain the best in the future.

** Solution 1 **

I can create one stream per data transfer, and synchronize directly with the stream which i can destroy once the data transfer is performed. This looks really costly, according to some quick’n’dirty microbenchmarks, and we must check all transfers independently so that such an algorithm may not scale very well with the number of transfers.

cudaStreamCreate 4.51 us
cudaStreamQuery 0.10 us
cudaStreamSynchronize 0.29 us
cudaThreadSynchronize 0.26 us
cudaStreamDestroy 0.19 us

And for the record, with the driver API we have
cuStreamCreate 0.31 us << note the HUGE difference with runtime API.
cuStreamQuery 0.25 us
cuStreamSynchronize 0.29 us
cuCtxSynchronize 0.25 us
cuStreamDestroy 0.30 us

** Solution 2 **

Another synchronization mean is the API based on cudaEvent_t . One solution would be that when we submit an asynchronous data transfer to a (unique) stream, we then submit an event in the same stream, and we query/synchronize with the event to check whether the associated data transfer is terminated or not. This event serialization is implicit in the documentation, since the event-based timing mechanisms documented would not work otherwise. This approach is also more scalable as we can stop checking whether the different data transfers are terminated as soon as we have an event that is not terminated.

On the paper, events look like much lighter to initialize than streams, but according to microbenchmarks on the same machine … I have very similar overheads:
cudaEventCreate 4.54 us
cudaEventDestroy 0.39 us

Also, by submitting an event every time i submit an asynchronous request, i have twice more items in the stream, and the “non-blocking” functions are even more likely to become blocking. So this approach looks more scalable, and should be much lighter, but the current CUDA driver implementation looks like to have some design issues.

** At last, my questions :) **

  • I guess i’m not the only person dealing with such asynchronous data transfers, so i suppose that other people have met similar issues, what are your approaches ?
  • CUDA 3.0 is expected soon, and should bring us the support for Fermi. Programming Fermi in a totally synchronous way looks like a terrible idea, and its design should naturally force people to think in a more asynchronous way (at last !!), could anyone who has some insights in the driver design tell me if there is one specific approach that should be successful in the future. I can imagine that the problem of functions silently becoming blocking will be fixed at some point, and i can imagine that events will become lighter than streams… What do you people see as the best synchronization mechanism ?

At a rather long term, having a single stream with serialized requests (events) will be limited since it won’t let the driver reorder data transfers (and report such optimization to the user). Such a design will fail when we have a good DMA controller, so what is the best way to handle many asynchronous requests in CUDA now and in a way that should make sense in the future ?