Async questions Kernels appear to stall host threads

belse · January 20, 2008, 12:36am

I have been scouring manuals, examples and the postings for an answer to this, so far without success. If anyone knows where I could find the answer that would be great.

I have been trying to execute a set of calculations asynchronously. I have a function that does this:

void Go(int stream)
{
cudaMemcpyAsync(d_in + (streamdataPerStream), h_in + (streamdataPerStream), dataPerStream, cudaMemcpyHostToDevice, streams[stream]);

// this call appears to block the calling thread until the async memcpy is done.
calc<<<blocks, threads, 0, streams[stream]>>>(d_in + (streamdataPerStream), d_out + (streamdataPerStream), stream*dataPerStream);

cudaMemcpyAsync(h_out + (streamdataPerStream), d_out + (streamdataPerStream), dataPerStream, cudaMemcpyDeviceToHost, streams[stream]);
}

h_ is host memory and d_ device.

The idea was that the host is generating the data and putting it in h_in. I spawn off streams to calculate parts of this. The streams run on the GPU asynchronously. My main host thread could then continue its tasks with the memcpys and calcs going on in the GPU. Then I would simply call cudaStreamSynchronize from the host when I want to know when a particular batch of calculations is finished or cudaStreamQuery if I wanted an update on progress.

The problem I’ve found is that it appears that the calc<<< , >>> calls block the calling thread until the async memory transfer from the host to the device are complete. Is this correct? Currently I am working in emulation mode so maybe its just an artifact of that (card arrives next week). I assumed that the calc would be queued until the memory transfer is finished because in the manual it says that the calc always happens asynchronously.

If what I am observing is the situation what is the recommendation for this use case? Should I spawn separate host threads for each GPU stream and manage the blocking in the host code (I hope not). If I do have to resort to this is it still advisable to use the cudaMemcpyAsync functions? In the manual it implies that efficiencies are gained in async functions because non-pageable memory is used and the GPU can address that more effectively. Given I can afford to use the pinned memory do async functions remove load from the host?

Finally if I do have to do the multi-threaded rework do I have to allocate device and host memory per thread? The manual says that cuda object can’t be shared between host threads, but I’m not sure if that is limited to timer and stream objects or if it includes memory allocated via the cuda API.

Many thanks in advance.

mfatica · January 20, 2008, 12:38am

Which card are you using?

seb · January 20, 2008, 12:54am

As far as I know in emulation mode everything is synchronous. So I think what you see is the calc<<< >>> call blocking the thread (and actually spawning a lot of threads) until the calculation is finished.
I would say the emulation mode is not a good way to test things like that as the behavior can be totally different from the device mode.

Also be advised that only G92 devices support the asynchronous memory copy.

belse · January 20, 2008, 1:12am

Thanks for your quick responses. I’ll try it on the actual card. Its a Tesla card. I’ll let you know if it does or doesn’t work.

Also I’ve just coded up the multi-host thread version just in case. It looks like I can share the memory.

Thanks again!

Topic		Replies	Views
Questions about when using cudaMemcpyAsync(), the host is blocked CUDA Programming and Performance	6	3460	April 5, 2018
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	484	October 19, 2023
Stream Synchronization Questions CUDA Programming and Performance	1	287	January 17, 2019
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	892	December 15, 2022
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1759	June 23, 2010
Overhead of using more than one streams? CUDA Programming and Performance	5	6172	April 14, 2009
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1571	October 3, 2021
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3141	July 14, 2010
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7029	April 28, 2011
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2258	May 30, 2009

Async questions Kernels appear to stall host threads

Related topics