What is the best way to communicate/synchronize small amount of data (<1KB or even smaller) across CTAs? Any docs to explain those cross CTA synchronization instructions?
The only cross CTA synchronization is the cooperative groups grid sync. It’s documented in the programming guide, there are sample codes, a blog, and various questions on various forums.
I have no doubt that there is potentially some “performance cost” to it. If you need to synchronize every block in the grid, then its the right tool I believe (and therefore mentioning “performance cost” might be comparing apples and oranges).
If you only need to communicate data from one threadblock to one other, it might be seen as “overkill”. However there isn’t anything more granular that I know of; you probably are in a roll-your-own situation, perhaps using atomics, volatile
, __threadfence
, etc. And for anything to be reliable for threadblock-to-threadblock communication, you need a means to ensure that both threadblocks are resident/instruction-issuable on SM(s) (either simultaneously, or else with some kind of defined/controlled ordering).
If you only need to communicate data from one threadblock to one other, I guess another suggestion might be to see if you can avoid it via algorithm/work refactoring. A __syncthreads()
is going to be lightweight compared to any of the previous discussion.
I guess, for completeness, I could also mention that the kernel launch boundary is a grid synchronization point. I assume that is not what is being asked here (i.e. break your code into two kernels, launch the first then the second, the launch boundary acts as a “grid sync point” of sorts. CTAs in the first launch can send data to CTAs in the second launch just by writing that data to global memory.)
Thank you for your quick reply, Robert. cg.sync seems to be interesting though it might not be what i need as it sync among all CTAs in a grid as you mentioned.
I have seen instructions like release/acquire PTX ISA :: CUDA Toolkit Documentation
But i have not found documentations to explain how to use them. Are you aware of any?
I can also use atomics to synchronize which might be easier to use though i was assuming that to be expensive as it needs to keep polling the global memory to achieve the synchronization i want. But maybe that is the best way in GPU world as that is best the scheduler can do?
To add to the atomic solution, will there be deadlocks?
If there are many warps and CTAs and not all of them can be scheduled at the same time, will the scheduler keep processing the existing scheduled warps till they are finished (but they will never finish as they can be waiting for some warps/CTAs that are not currently scheduled)
No I’m not. I mostly focus on C++ myself. this may be of interest.
Yes, there could be, if you make incorrect assumptions. I intended to indicate the possibility/necessity here:
Thanks for your suggestions. I think controlled ordering is achievable. Are there any instructions on how to control threadblocks being scheduled? Are they scheduled based on threadblock ids (from low to high)?
Thanks again
No, there is no specification for CUDA threadblock scheduling. A general rule for CUDA is that CUDA provides no guarantees of thread execution order at any granularity (warp, CTA, grid), which covers the threadblock case. Therefore it is not guaranteed that they are scheduled from “low to high”. Furthermore, CUDA exposes no controls over the block scheduler (other than the stream priority mechanism, and MPS, which are not relevant here, at a minimum since they involve blocks from separate kernels, and in addition they give no fine-grained ordering.)
You can take control of threadblock scheduling yourself (and in so doing impose order) via a mechanism built on top of atomics (and perhaps, optionally, the PTX smid
register). This is not an exact recipe but it contains all the building blocks.
To take control of threadblock ordering yourself, the starting point is to probably dispense with use of blockIdx
variables, and establish a global integer location initialized to zero. Each block starts by doing an atomicAdd
of 1 to the global location, and using the function return value as its global block number. In this way it is guaranteed that according to this global block number, the block scheduling order will be 0,1,2,3, … etc. If block number 3 observes that it is scheduled, it can be certain that block 2 has already been scheduled.
That does not guarantee co-residency however. The usual method to guarantee co-residency is covered in the cooperative groups documentation (already linked above). Basically, query the device you are running on to discover maximum blocks that can be co-resident, then limit your kernel launch to that. There are important details, here, as an optimal job requires occupancy consideration. Since your kernel will have an “arbitrary” number of blocks in this strategy, a usual kernel design method for this could be grid stride loop design to decouple the dimension of the work from the dimension of the grid.
Thank you, Robert. It is a clever idea to use atomics to achieve the order.
Will a scheduled block be de-scheduled for any reason? If that can happen, then this atomic solution might still fail. I assume not?
Threadblocks can be pre-empted. It’s an unusual case as far as I know, but not precluded. The usual cases for threadblock preemption possibility that I know of include debugging, cuda/graphics interoperation on the same GPU, and CUDA dynamic parallelism. People who are interested in these kinds of ideas would usually avoid those cases. You cannot have everything that is possible, all at the same time.
If you’re concerned about any of this, my suggestion would be to avoid these “roll your own” ideas, and instead go with the provided mechanisms, e.g. coooperative groups grid sync, or refactoring within the CTA so as to be able to use __syncthreads
(ie. avoid inter-CTA communication) or the grid launch itself, as already discussed.
Thanks again. This is very useful!