…
// asynchronously issue work to the GPU (all to stream 0)
CUT_SAFE_CALL( cutStartTimer(timer) );
cudaEventRecord(start, 0);
cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0); step 1
increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value); step 2
cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0); step 3
cudaEventRecord(stop, 0);
CUT_SAFE_CALL( cutStopTimer(timer) );
// have CPU do some work while waiting for stage 1 to finish
unsigned long int counter=0;
while( cudaEventQuery(stop) == cudaErrorNotReady )
{
counter++;
}
…
I wonder if I have cpu do something like “a = null” or just change some value in the array “a” or “d_a”, is it possible ? And if it is possible , does it happen randomly in GPU time(step 1 /step 2/step 3) ?
Yes, you have to explicitly synchronize CPU/GPU in this code.
If you stick a cudaEventRecord() call after step 1), you can later cudaEventSynchronize() on that event to make sure the CPU can operate on the buffer.
Try to avoid spinning on *Query calls, use *Synchronize instead.
Really? Aren’t operations assigned to a single stream (the default zero stream in this case) still executed sequentially, not out-of-order?
(Of course one then has to synchronize before reading the contents of the pinned buffer on CPU.)
By the way, one thing I wish the stream abstraction should have is being able to join two streams together, ie. to make a “main” stream wait for an “branch” one and then destroy the branch stream without blocking CPU execution. The (runtime) API could look like this:
hamqy’s question was about CPU/GPU concurrency. All of the *Async host<->device memcpy calls return immediately, so the app must synchronize CPU access to the host memory participating in such memcpy’s. The *Synchronize calls are the preferred way to do that synchronization.
Streams are more about GPU/GPU concurrency (concurrent memcpy/kernel processing today, possibly other operations in future). If synchronization between streams becomes important, we will certainly consider adding something like cudaStreamJoin in a future release. Note though that cudaStreamJoin would be asynchronous :-)
When using multiple GPUs I guess it is a good idea (or required?) to serve each GPU from a different CPU thread. That much is obvious.
But how about streams then. Does each CPU serving thread has it’s own collection of streams or are streams “global”.
To clairfy without the streaming API and 2 GPUs I would do something like this:
Thread0:
cudaSetDevice(0);
dostuff();
Thread1:
cudaSetDevice(1);
dootherstuff();
How is the straeming API used in this particular case?
Thanks a lot. nwilt’s explaination is very clearly.
To seb , I remember a sample called “cudaOpenMP” deals with multiple cpu threads and devices, it may be helpful