when use syncthreads() in kernel, only all thread in a block will be sync but between block and block, its not sync.
But in stream control, we can use cudaThreadSynchronize() to sync all blocks, is it right.
And if I do not use any stream to launch kernel(put blank for stream or use 0).
Is cudaThreadSynchronize() can still be used to make all of the blocks be sync after kernel launch? or I don’t need it because it will be auto sync after kernel launch?
You can’t make all block sync because blocks run until they complete and are not swapped in and out at partial compute stages. So at any given point, you may have say 200 blocks which are finished, 60 blocks which are active and all running, and 200 more blocks which have not even begun.
Therefore the proper way to syncronize blocks is to use kernel launches. Don’t be scared of this, running a kernel is very little overhead, only about 10us or so, so it’s pretty reasonable to launch even a dozen kernels in a row to pipeline the various stages of your algorithm.
cudaThreadSyncronize() is NOT needed for this kind of barrier, that’s used to synchronize all kernel launches with the CPU. The kernels themselves are queued by the device, and you can use the stream API to have finer control over these dependencies.
In your first question, yes, all of kernel 1 will finish before kernel 2 starts, so you can be confident that all of kernel 1’s work has been completed. (This is assuming you’re using the same stream for both. If not, what order they run in is undetermined.)
Your second question “can I use D right away” is vague. If you want to use the results in D from the CPU, then you will need to copy it to the CPU. Cudamemcpy will block until kernel 1 and the mem copy is done and therefore it’ll work fine.
Alternatively you can queue an async mem copy, then on the CPU do a threadSyncronize() but for most simple cases that’s likely not necessary.