Global Sync

I’m wondering if there’s a way to synchronize threads not only inside a block but over the entire grid. So I’m looking for a __syncthreads() that syncs all threads currently running.
The obvious solution would be to split the kernel in multiple kernels and call them sequentially (with cudaThreadSynchronize() in between).
However I would like to accomplish a global sync without returning to the host. Is there a way to do it?

This has been discussed before and many people have posted code. I will tell you now that trying it is fruitless, and there is a simple reason why: Lets say you have a grid of 500 blocks to run. The device can only run M blocks concurrently depending on register usage and the such, but M << 500. The first M kernels start up, run, then perform one of the many global memory syncs posted on the forums: they are now waiting on kernels M+1 thru 500. But, kernels M+1 thru 500 are waiting on blocks 0 thru M to exit so that there are available multiprocessors to run on. Deadlock.

Of course, you can just run less than M blocks, but then you run into the problem that M is really small and you will not be using the device very optimally.

One note: You do not need a cudaThreadSynchronize in-between kernels. You can keep calling kernel after kernel and the driver will queue them up, always making sure not to start the second until the first has finished.

So that’s what this stuff is all about :)

Well, sorry for asking. And thanks for clarifying it for me. So basically there’s no block scheduler but it’s a first block in first block out thing or something like that.

Thanks for pointing that out. That actually answers my question because the intention behind it was to avoid cudaThreadSynchronize() busy waiting.

Are cudaMemcpy and cudaMemset-calls (and so forth) also queued by the driver?

I haven’t seen the other codes, but I think that with enough atomics (and critical sections made out of atomics), you could do it. However, the bigger problem is that I imagine most algorithms that want a kernel-wide sync also want all 500 blocks to finish doing their respective part of the task. However, if it’s a different matter of resource contention or something, I think atomics could get it done.

MisterAnderson, are you 100% sure there’s never any overlap in execution? I don’t have evidence that there is, but it would be a more efficient design choice.

Unfortunately, eventually the queue will get full and there will still be a busy wait in there somewhere. But removing the synchronize on every kernel still improves performance significantly. Somewhere in the programming guide (I think under the asynchronous launch section) it documents in more detail what is async and what is not. IIRC, cudaMemcpy to/from the host adds an implicit cudaThreadSynchronize(). Not sure about device->device copies.

It has been said in another thread that 1.1 will improve the capability of CUDA to overlap even more.

Am I 100% sure that there is no overlap? No, I didn’t design the device, so I can’t be certain. Here are the things that lead me to the no overlap conclusion, though:

1)The guide says: “A block is processed by only one multiprocessor, so that the shared memory space resides in the on-chip shared memory leading to very fast memory accesses.”

This sentence implies to me that a block exists on that multiprocessor for it’s entire lifetime, contiguously. And all later sentences talk about concurrent blocks within the limits of shared memory and register usage.

  1. If the block were to be “swapped out”, the shared memory and registers would need to be dumped to global memory somewhere, then read back in when the block is “swapped in”. This would take a big chunk out of the available memory bandwidth. A memory-bound kernel with 10,000 blocks can easily approach the memory bandwidth limit of the device => there is no “missing” bandwidth from the swapping.

  2. In one of the earlier threads on block synchronization, the OP’s tests worked when the number of blocks launched was <= M, and deadlocked when > M (I can dig up the post if you really want me to). This test is pretty conclusive to me.

aha! but just because blocks have to run in one piece doesn’t mean that blocks from different kernel executions can’t mix (across MPs or on the same MP)

Oh, that is the overlap you were referring to. Sorry, I was confused.

The guide doesn’t come right out and define how the launches are asynchronous, but I did find this:
“Note that the texture cache is not kept coherent with respect to global memory accesses, so the global memory ranges that a kernel is operating on must not overlap with memory accessed by the texturing hardware. This restriction only applies within a given kernel launch, however; separate kernel launches may freely intermix writing to device memory and reading from the same device memory via texture, provided the device memory ranges do not overlap during a launch.”

That seems to imply that one entire kernel launch completes before the next one begins.

Correct, the driver waits until kernel completion and full propagation of global store requests.