asyncAPI sample question

hamqy · December 14, 2007, 2:22am

Here is code from sample “asyncAPI”:

…
// asynchronously issue work to the GPU (all to stream 0)
CUT_SAFE_CALL( cutStartTimer(timer) );
cudaEventRecord(start, 0);
cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0); step 1
increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value); step 2
cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0); step 3
cudaEventRecord(stop, 0);
CUT_SAFE_CALL( cutStopTimer(timer) );

// have CPU do some work while waiting for stage 1 to finish
unsigned long int counter=0;
while( cudaEventQuery(stop) == cudaErrorNotReady )
{
counter++;
}
…

I wonder if I have cpu do something like “a = null” or just change some value in the array “a” or “d_a”, is it possible ? And if it is possible , does it happen randomly in GPU time(step 1 /step 2/step 3) ?

Thank you.

nwilt · December 14, 2007, 7:40pm

Yes, you have to explicitly synchronize CPU/GPU in this code.

If you stick a cudaEventRecord() call after step 1), you can later cudaEventSynchronize() on that event to make sure the CPU can operate on the buffer.

Try to avoid spinning on *Query calls, use *Synchronize instead.

pyrtsa · December 15, 2007, 7:10am

Really? Aren’t operations assigned to a single stream (the default zero stream in this case) still executed sequentially, not out-of-order?

(Of course one then has to synchronize before reading the contents of the pinned buffer on CPU.)

By the way, one thing I wish the stream abstraction should have is being able to join two streams together, ie. to make a “main” stream wait for an “branch” one and then destroy the branch stream without blocking CPU execution. The (runtime) API could look like this:

cudaError_t cudaStreamJoin(cudaStream_t main, cudaStream_t branch);

/Pyry

Sarnath · December 15, 2007, 4:21pm

STREAM by definition itself is “sequential”. Multiple streams can execute concurerntly and there is no ordering between multiple streams.

For example : The River (Stream) NILE can flow from North to South and the River (stream) Amazon can flow from south to North.

pyrtsa · December 15, 2007, 9:02pm

That’s what I was saying. I was arguing that nwilt gave somewhat misleading advice.

nwilt · December 17, 2007, 6:39pm

hamqy’s question was about CPU/GPU concurrency. All of the *Async host<->device memcpy calls return immediately, so the app must synchronize CPU access to the host memory participating in such memcpy’s. The *Synchronize calls are the preferred way to do that synchronization.

Streams are more about GPU/GPU concurrency (concurrent memcpy/kernel processing today, possibly other operations in future). If synchronization between streams becomes important, we will certainly consider adding something like cudaStreamJoin in a future release. Note though that cudaStreamJoin would be asynchronous :-)

pyrtsa · December 17, 2007, 10:38pm

Naturally!

seb · December 17, 2007, 10:46pm

I have a related question:

When using multiple GPUs I guess it is a good idea (or required?) to serve each GPU from a different CPU thread. That much is obvious.
But how about streams then. Does each CPU serving thread has it’s own collection of streams or are streams “global”.

To clairfy without the streaming API and 2 GPUs I would do something like this:

Thread0:
cudaSetDevice(0);
dostuff();

Thread1:
cudaSetDevice(1);
dootherstuff();

How is the straeming API used in this particular case?

nwilt · December 18, 2007, 1:41am

Streams are scoped per context, so you do not have to explicitly have to use the streaming API when using multiple GPUs.

Referencing stream 0 in each context facilitates CPU/GPU concurrency (which is especially useful when driving multiple GPUs, for sure).

hamqy · December 18, 2007, 1:47am

Thanks a lot. nwilt’s explaination is very clearly.
To seb , I remember a sample called “cudaOpenMP” deals with multiple cpu threads and devices, it may be helpful

Topic		Replies	Views
question on asyncAPI.cu CUDA Programming and Performance	1	615	February 12, 2011
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	496	December 17, 2017
Callbacks from GPU to CPU CUDA Programming and Performance	5	3720	December 10, 2008
How does cudaMemcpyPeer(Async) work with streams? CUDA Programming and Performance	1	472	September 25, 2023
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7890	January 5, 2016
How to use streams for asynch transfers CUDA Programming and Performance	3	882	February 18, 2011
CPU blocked MUCH longer than expected calling a cudaMemcpy after a cuda graph launch CUDA Programming and Performance	7	574	October 19, 2023
Asynchronous performance between CPU and GPU CUDA Programming and Performance	3	2385	June 18, 2012
4.0 RC - many host threads per one GPU - cudaStreamQuery and cudaStreamSynchronize behaviour. CUDA Programming and Performance	5	13394	March 10, 2011
Multiple CPU threads with multiple cudaStreams CUDA Programming and Performance	5	6167	July 23, 2015

asyncAPI sample question

Related topics