Max (resident | active) threads for V100 & A100

uniadam · October 31, 2021, 9:54pm

I have this configuration for launching a kernel:

dim3 grid(32, 1, 1);
dim3 threads(512, 1, 1);

So Total number of thread should be 16384. By launching this kernel, would be all thread kernel resident or active on SM? I think resident and active are different things. is it correct? So maybe all are resident but some are active?! I appreciate if you make this clear to me.

Do I need 1 SM or more? Can I put all of them in 1 SM and save other resources?

What happen if I launch 32 numbers of this kernel simultaneously?

In a part of CUDA documnts I am reading:

the GV100 SM provides 64 FP32 cores and 32 FP64 cores. The GV100 SM additionally includes 64 INT32 cores and 8 mixed-precision Tensor Cores. GV100 provides up to 84 SMs.

So here for performing float operation how many SM do I have? is SM and CORE a same thing? (84SMs vs 64 FP32 core).

Greg · October 31, 2021, 10:46pm

                                  GV100   GA100
Streaming Multiprocessors (SMs)   80/84   108/128
Max Thread Blocks Per SM          32      32
Max Warps Per SM                  64      64
Max Threads per Thread Block      1024    1024

The language provides no guarantees that work will be co-resident. Resident and active are interchangeable as it relates to thread blocks (ctas), warps, and threads. The term active means that the schedule entity has all of its resources allocated and has been allocated to a warp scheduler on a Streaming Multiprocessor.

The language does not state. In practice the compute work distributor will distribute thread blocks (ctas) round robin to the the streaming multiprocessors (SM). Given a {32,1,1} grid only 32/80 or 32/108 SMs will be assigned a thread block. For the SMs that are assigned a thread block the {512,1,1} threads will be rasterized into 16 warps and assigned round robin to the SM sub-partitions (warp schedulers).

Launching 32 grids of the same kernel in the same stream the grids would executed in order. There would be no concurrency between grids. Only 32/80 or 32/108 SMs will be active. The other SMs will be idle.

Launching 32 grids of the same kernel in different streams will result in concurrent execution. You have not provided details to determine the warp occupancy. On a simple kernel (no shared memory, small number of registers per threads) thread blocks will be distributed from the first grid, second grid, … until the GPU is saturated. As thread blocks complete the compute work distributor will distribute new work to the SMs. There is no guarantee on the order that streams will be scheduled on the GPU.

The term “core” in the CUDA manual refers to an instruction execution pipeline not to a “cpu core”. At a first pass you are interested in making sure each SM has sufficient warps (threads) to hide latency. This often requires >1024 threads if the kernel is latency bound. If the kernel is compute bound only 256-512 threads per SM may be required.

Topic		Replies	Views
Get the number of cores used in a kernel launch CUDA Programming and Performance	4	1287	December 12, 2019
Max 1 or 2 concurrent kernels per SM? CUDA Programming and Performance	19	11705	May 22, 2014
More blocks than SMs may not make sense CUDA Programming and Performance	13	2675	November 11, 2010
What is the actual limit on simultaneously running threads? Asin, is it possible for more than one b CUDA Programming and Performance	20	2493	September 16, 2010
How to know the maximum blocks I can launch CUDA Programming and Performance jetson	10	321	November 9, 2024
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3825	August 23, 2017
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	802	March 30, 2024
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19674	July 5, 2011
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2484	July 4, 2019
Grid size performance implications CUDA Programming and Performance	3	2903	October 10, 2009

Max (resident | active) threads for V100 & A100

Related topics