Block/CTA Scheduling

Hi All,

I am almost certain there was a discussion at least relating to this topic once upon a time. However, I am unable to find it anymore.

My question regards the scheduling of Blocks/CTAs. Now we all know that the CUDA Prog. Guide explicitly says no block order is guaranteed. However, lets say I want to have some interaction at the global memory level with my CTAs. Normally such dependencies are a bad idea, since we don’t know if the interacting CTAs will be scheduled at the same time. However, lets say I have 30 multiprocessors (SMs) (like a Geforce GTX 280), and I launch 30 CTAs. In practice (in my experience) this is just fine, and I am able to have all 30 CTAs interact correctly at the global memory level. However, is there any citation out there suggesting that all 30 CTAs will disperse across 30 SMs, vs. all 30 CTAs being scheduled on a single SM?

Thanks!

Hi All,

I am almost certain there was a discussion at least relating to this topic once upon a time. However, I am unable to find it anymore.

My question regards the scheduling of Blocks/CTAs. Now we all know that the CUDA Prog. Guide explicitly says no block order is guaranteed. However, lets say I want to have some interaction at the global memory level with my CTAs. Normally such dependencies are a bad idea, since we don’t know if the interacting CTAs will be scheduled at the same time. However, lets say I have 30 multiprocessors (SMs) (like a Geforce GTX 280), and I launch 30 CTAs. In practice (in my experience) this is just fine, and I am able to have all 30 CTAs interact correctly at the global memory level. However, is there any citation out there suggesting that all 30 CTAs will disperse across 30 SMs, vs. all 30 CTAs being scheduled on a single SM?

Thanks!

THerez no citation. but we too experimented and found that CTAs will be distributed across 30 SMs.

You can force it, by increasing the size of shared memory of your kernel(dynamically) so that an SM will work only on 1 block at a time.

You can also check this thread: http://forums.nvidia.com/index.php?showtopic=59537&mode=threaded&pid=325912

THerez no citation. but we too experimented and found that CTAs will be distributed across 30 SMs.

You can force it, by increasing the size of shared memory of your kernel(dynamically) so that an SM will work only on 1 block at a time.

You can also check this thread: http://forums.nvidia.com/index.php?showtopic=59537&mode=threaded&pid=325912

Are you asking if it’s safe to rely on this behavior? It is not.

Are you asking if it’s safe to rely on this behavior? It is not.

Well since all blocs are identical algorithm should distribute them on SMs 1 to 1 - more, 1 SM can handle up to 8 blocks at the same time if sum of their resources is smaller than restrictions per SM - on my fermi I tested global sync - I have 7 SMs and I could sync 7*8 = 56 blocks without deadlock… But for 57 blocks and more deadlock occured, even if blocks used minimum resources.

Well since all blocs are identical algorithm should distribute them on SMs 1 to 1 - more, 1 SM can handle up to 8 blocks at the same time if sum of their resources is smaller than restrictions per SM - on my fermi I tested global sync - I have 7 SMs and I could sync 7*8 = 56 blocks without deadlock… But for 57 blocks and more deadlock occured, even if blocks used minimum resources.