About Synchronize

Hi, I’m confused about the sync mechanism.

when use syncthreads() in kernel, only all thread in a block will be sync but between block and block, its not sync.

But in stream control, we can use cudaThreadSynchronize() to sync all blocks, is it right.

And if I do not use any stream to launch kernel(put blank for stream or use 0).
Is cudaThreadSynchronize() can still be used to make all of the blocks be sync after kernel launch? or I don’t need it because it will be auto sync after kernel launch?

You can’t make all block sync because blocks run until they complete and are not swapped in and out at partial compute stages. So at any given point, you may have say 200 blocks which are finished, 60 blocks which are active and all running, and 200 more blocks which have not even begun.

Therefore the proper way to syncronize blocks is to use kernel launches. Don’t be scared of this, running a kernel is very little overhead, only about 10us or so, so it’s pretty reasonable to launch even a dozen kernels in a row to pipeline the various stages of your algorithm.

cudaThreadSyncronize() is NOT needed for this kind of barrier, that’s used to synchronize all kernel launches with the CPU. The kernels themselves are queued by the device, and you can use the stream API to have finer control over these dependencies.

Thank you.

So,

when the code is like:

kernel1<<<>>>(float* S,float* D);

kernel2<<<>>>(float* S,float* D);

then all of the block of kernel1 will be sync before kernel2 launch, is it?

But when the code is like

kernel1<<<>>>(float* S,float* D);

then use the data in D right away. will all of the data be right at the time?

What I mean is without kernel2 to make sync of all of blocks in kernel1.

In your first question, yes, all of kernel 1 will finish before kernel 2 starts, so you can be confident that all of kernel 1’s work has been completed. (This is assuming you’re using the same stream for both. If not, what order they run in is undetermined.)

Your second question “can I use D right away” is vague. If you want to use the results in D from the CPU, then you will need to copy it to the CPU. Cudamemcpy will block until kernel 1 and the mem copy is done and therefore it’ll work fine.
Alternatively you can queue an async mem copy, then on the CPU do a threadSyncronize() but for most simple cases that’s likely not necessary.

Thank you so much.

For the 2nd question, If the S and D are all device pointer which are point to mappable host memroy.

then I don’t need to do memory copy form device to host.

How to make sure all of the data in the mappable host memory is calcuated and can be use in following operation?