Async questions Kernels appear to stall host threads

I have been scouring manuals, examples and the postings for an answer to this, so far without success. If anyone knows where I could find the answer that would be great.

I have been trying to execute a set of calculations asynchronously. I have a function that does this:

void Go(int stream)
cudaMemcpyAsync(d_in + (streamdataPerStream), h_in + (streamdataPerStream), dataPerStream, cudaMemcpyHostToDevice, streams[stream]);

// this call appears to block the calling thread until the async memcpy is done.
calc<<<blocks, threads, 0, streams[stream]>>>(d_in + (streamdataPerStream), d_out + (streamdataPerStream), stream*dataPerStream);

cudaMemcpyAsync(h_out + (streamdataPerStream), d_out + (streamdataPerStream), dataPerStream, cudaMemcpyDeviceToHost, streams[stream]);

h_ is host memory and d_ device.

The idea was that the host is generating the data and putting it in h_in. I spawn off streams to calculate parts of this. The streams run on the GPU asynchronously. My main host thread could then continue its tasks with the memcpys and calcs going on in the GPU. Then I would simply call cudaStreamSynchronize from the host when I want to know when a particular batch of calculations is finished or cudaStreamQuery if I wanted an update on progress.

The problem I’ve found is that it appears that the calc<<< , >>> calls block the calling thread until the async memory transfer from the host to the device are complete. Is this correct? Currently I am working in emulation mode so maybe its just an artifact of that (card arrives next week). I assumed that the calc would be queued until the memory transfer is finished because in the manual it says that the calc always happens asynchronously.

If what I am observing is the situation what is the recommendation for this use case? Should I spawn separate host threads for each GPU stream and manage the blocking in the host code (I hope not). If I do have to resort to this is it still advisable to use the cudaMemcpyAsync functions? In the manual it implies that efficiencies are gained in async functions because non-pageable memory is used and the GPU can address that more effectively. Given I can afford to use the pinned memory do async functions remove load from the host?

Finally if I do have to do the multi-threaded rework do I have to allocate device and host memory per thread? The manual says that cuda object can’t be shared between host threads, but I’m not sure if that is limited to timer and stream objects or if it includes memory allocated via the cuda API.

Many thanks in advance.

Which card are you using?

As far as I know in emulation mode everything is synchronous. So I think what you see is the calc<<< >>> call blocking the thread (and actually spawning a lot of threads) until the calculation is finished.
I would say the emulation mode is not a good way to test things like that as the behavior can be totally different from the device mode.

Also be advised that only G92 devices support the asynchronous memory copy.

Thanks for your quick responses. I’ll try it on the actual card. Its a Tesla card. I’ll let you know if it does or doesn’t work.

Also I’ve just coded up the multi-host thread version just in case. It looks like I can share the memory.

Thanks again!