Block/CTA Scheduling

enemyben88 · November 10, 2010, 4:38pm

Hi All,

I am almost certain there was a discussion at least relating to this topic once upon a time. However, I am unable to find it anymore.

My question regards the scheduling of Blocks/CTAs. Now we all know that the CUDA Prog. Guide explicitly says no block order is guaranteed. However, lets say I want to have some interaction at the global memory level with my CTAs. Normally such dependencies are a bad idea, since we don’t know if the interacting CTAs will be scheduled at the same time. However, lets say I have 30 multiprocessors (SMs) (like a Geforce GTX 280), and I launch 30 CTAs. In practice (in my experience) this is just fine, and I am able to have all 30 CTAs interact correctly at the global memory level. However, is there any citation out there suggesting that all 30 CTAs will disperse across 30 SMs, vs. all 30 CTAs being scheduled on a single SM?

Thanks!

enemyben88 · November 10, 2010, 4:38pm

Hi All,

I am almost certain there was a discussion at least relating to this topic once upon a time. However, I am unable to find it anymore.

My question regards the scheduling of Blocks/CTAs. Now we all know that the CUDA Prog. Guide explicitly says no block order is guaranteed. However, lets say I want to have some interaction at the global memory level with my CTAs. Normally such dependencies are a bad idea, since we don’t know if the interacting CTAs will be scheduled at the same time. However, lets say I have 30 multiprocessors (SMs) (like a Geforce GTX 280), and I launch 30 CTAs. In practice (in my experience) this is just fine, and I am able to have all 30 CTAs interact correctly at the global memory level. However, is there any citation out there suggesting that all 30 CTAs will disperse across 30 SMs, vs. all 30 CTAs being scheduled on a single SM?

Thanks!

Sarnath · November 12, 2010, 6:26am

THerez no citation. but we too experimented and found that CTAs will be distributed across 30 SMs.

You can force it, by increasing the size of shared memory of your kernel(dynamically) so that an SM will work only on 1 block at a time.

You can also check this thread: The Official NVIDIA Forums | NVIDIA

Sarnath · November 12, 2010, 6:26am

THerez no citation. but we too experimented and found that CTAs will be distributed across 30 SMs.

You can force it, by increasing the size of shared memory of your kernel(dynamically) so that an SM will work only on 1 block at a time.

You can also check this thread: The Official NVIDIA Forums | NVIDIA

tmurray · November 12, 2010, 7:40am

Are you asking if it’s safe to rely on this behavior? It is not.

tmurray · November 12, 2010, 7:40am

Are you asking if it’s safe to rely on this behavior? It is not.

mdew · November 12, 2010, 12:34pm

Well since all blocs are identical algorithm should distribute them on SMs 1 to 1 - more, 1 SM can handle up to 8 blocks at the same time if sum of their resources is smaller than restrictions per SM - on my fermi I tested global sync - I have 7 SMs and I could sync 7*8 = 56 blocks without deadlock… But for 57 blocks and more deadlock occured, even if blocks used minimum resources.

mdew · November 12, 2010, 12:34pm

Well since all blocs are identical algorithm should distribute them on SMs 1 to 1 - more, 1 SM can handle up to 8 blocks at the same time if sum of their resources is smaller than restrictions per SM - on my fermi I tested global sync - I have 7 SMs and I could sync 7*8 = 56 blocks without deadlock… But for 57 blocks and more deadlock occured, even if blocks used minimum resources.

esoha · October 24, 2024, 1:29am

You can now guarantee that all the CTAS will be present on SMs at the by using a cooperative launch.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html#group__CUDART__EXECUTION_1g504b94170f83285c71031be6d5d15f73

Topic		Replies	Views
How to do static block scheduling in CUDA? How to evenly assign blocks to SMs? CUDA Programming and Performance	4	985	January 31, 2022
some doubts about the task scheduling of NVIDIA GPU CUDA Programming and Performance	6	2181	May 26, 2017
A basic question about block scheduling CUDA Programming and Performance	13	2172	March 17, 2014
performance cost of too many blocks? CUDA Programming and Performance	12	2804	December 4, 2018
Best way to communicate small amount of data across CTAs? CUDA Programming and Performance	9	1536	August 3, 2022
Bug report: Incorrect block scheduling CUDA Programming and Performance	18	7705	February 19, 2010
More blocks than SMs may not make sense CUDA Programming and Performance	13	2674	November 11, 2010
CUDA block scheduling problems CUDA Programming and Performance	4	494	March 24, 2023
Can warps from different CTAs be coscheduled? CUDA Programming and Performance	5	231	July 6, 2024
The order of CTA execution CUDA Programming and Performance	5	469	April 11, 2024

Block/CTA Scheduling

Related topics