asyncAPI sample question

Here is code from sample “asyncAPI”:

// asynchronously issue work to the GPU (all to stream 0)
CUT_SAFE_CALL( cutStartTimer(timer) );
cudaEventRecord(start, 0);
cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0); step 1
increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value); step 2
cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0); step 3
cudaEventRecord(stop, 0);
CUT_SAFE_CALL( cutStopTimer(timer) );

// have CPU do some work while waiting for stage 1 to finish
unsigned long int counter=0;
while( cudaEventQuery(stop) == cudaErrorNotReady )

I wonder if I have cpu do something like “a = null” or just change some value in the array “a” or “d_a”, is it possible ? And if it is possible , does it happen randomly in GPU time(step 1 /step 2/step 3) ?

Thank you.

Yes, you have to explicitly synchronize CPU/GPU in this code.

If you stick a cudaEventRecord() call after step 1), you can later cudaEventSynchronize() on that event to make sure the CPU can operate on the buffer.

Try to avoid spinning on *Query calls, use *Synchronize instead.

Really? Aren’t operations assigned to a single stream (the default zero stream in this case) still executed sequentially, not out-of-order?

(Of course one then has to synchronize before reading the contents of the pinned buffer on CPU.)

By the way, one thing I wish the stream abstraction should have is being able to join two streams together, ie. to make a “main” stream wait for an “branch” one and then destroy the branch stream without blocking CPU execution. The (runtime) API could look like this:

cudaError_t cudaStreamJoin(cudaStream_t main, cudaStream_t branch);


STREAM by definition itself is “sequential”. Multiple streams can execute concurerntly and there is no ordering between multiple streams.

For example : The River (Stream) NILE can flow from North to South and the River (stream) Amazon can flow from south to North.

That’s what I was saying. I was arguing that nwilt gave somewhat misleading advice.

hamqy’s question was about CPU/GPU concurrency. All of the *Async host<->device memcpy calls return immediately, so the app must synchronize CPU access to the host memory participating in such memcpy’s. The *Synchronize calls are the preferred way to do that synchronization.

Streams are more about GPU/GPU concurrency (concurrent memcpy/kernel processing today, possibly other operations in future). If synchronization between streams becomes important, we will certainly consider adding something like cudaStreamJoin in a future release. Note though that cudaStreamJoin would be asynchronous :-)


I have a related question:

When using multiple GPUs I guess it is a good idea (or required?) to serve each GPU from a different CPU thread. That much is obvious.
But how about streams then. Does each CPU serving thread has it’s own collection of streams or are streams “global”.

To clairfy without the streaming API and 2 GPUs I would do something like this:



How is the straeming API used in this particular case?

Streams are scoped per context, so you do not have to explicitly have to use the streaming API when using multiple GPUs.

Referencing stream 0 in each context facilitates CPU/GPU concurrency (which is especially useful when driving multiple GPUs, for sure).

Thanks a lot. nwilt’s explaination is very clearly.
To seb , I remember a sample called “cudaOpenMP” deals with multiple cpu threads and devices, it may be helpful