I am almost certain there was a discussion at least relating to this topic once upon a time. However, I am unable to find it anymore.
My question regards the scheduling of Blocks/CTAs. Now we all know that the CUDA Prog. Guide explicitly says no block order is guaranteed. However, lets say I want to have some interaction at the global memory level with my CTAs. Normally such dependencies are a bad idea, since we don’t know if the interacting CTAs will be scheduled at the same time. However, lets say I have 30 multiprocessors (SMs) (like a Geforce GTX 280), and I launch 30 CTAs. In practice (in my experience) this is just fine, and I am able to have all 30 CTAs interact correctly at the global memory level. However, is there any citation out there suggesting that all 30 CTAs will disperse across 30 SMs, vs. all 30 CTAs being scheduled on a single SM?