In an application consisting of several concurrent kernels in separate streams (e.g., A and B), is it possible to wait for the first stream to complete (i.e., block the calling host thread until either A or B complete)?
cudaStreamSynchronize() does this only for a single stream, is there anything like a UNIX select() on a set of streams in CUDA?
I don’t know if it’s the most efficient way, but you could just have a while(true) loop where it calls cuStreamQuery() on each stream and when one of them is true, break out of the loop. You’ll want to add a call to sleep() in there as well (so you don’t peg the processor with your loop), but you’ll have to try different values to see what works best, depending on how long your kernels run.
Yes, I already considered this, but it’s not really elegant and - as you noted - probably also inefficient in many cases, so I was hoping for a cleaner solution. However, after some more research I think I need a totally different approach anyway. I tried to invoke several kernels in parallel within the same CUDA context and thought streams could do that (the CUDA 2.3 programming guide says in section 3.2.6.1 “Different streams … may execute their commands out of order with respect to one another or concurrently”). On the other hand, “sdk/C/src/simpleStreams/readme.txt” in the CUDA SDK says “Kernels are serialized”. After some experimenting I found out that the latter is true (I wished this would have been clarified in the manual as well).
It looks like command queues in OpenCL could solve the problem (i.e., make the “wait for any”-primitive obsolete), so let me reformulate my question: is there anything like OpenCL command queues in CUDA, or any way to achieve similar behaviour (or maybe planned for a future CUDA release)?
I tried to build a task dependency graph (similar to the command queue in OpenCL). My first attempt was to do all the synchronization on the host, using separate host threads for concurrent paths in the graph. This failed since device memory pointers can’t be shared across host threads. When trying to accomplish the same thing in a single host thread, there must be a way to find out when any kernel has completed because at this point in time a new task is ready for execution (i.e., a new kernel should be launched). This again failed since kernel calls are serialized even if they occur in different streams :-(
So my conclusion for now is to investigate OpenCL where a dependency graph should work out-of-the-box by constructing an appropriate command queue. Nevertheless, I’m curious if such a feature will be included in future CUDA versions as well (or if it is even in CUDA-2.3 and I overlooked it, or at least can be emulated).