Max threads/blocks

abalducci0 · September 2, 2024, 7:08pm

Hi,

So I’ve just started taking the Getting Started with Accelerated Computing in CUDA C/C++ course and have completed the first section–

But I had a question regarding regarding the max threads / blocks that doesn’t seem to be mentioned.

I mean I can understand if convention says the max threads you can have per block is 1024-- But what then about the max number of blocks ? There seems no mention of this.

Or what I’m getting at, some cards have way more CUDA cores then others, so this must figure, somehow, when determining the number of blocks available, no ?

If not, then okay, but if so, how do you query the card to know how many blocks you have to work with ?

striker159 · September 2, 2024, 7:20pm

The limitations per GPU architecture are listed in the programming guide, Table 21. CUDA C++ Programming Guide

The maximum number of thread blocks for a kernel is 2^31-1 * 65535 * 65535

njuffa · September 2, 2024, 7:22pm

Table 21 in section 16 of the CUDA Programming Guide lists the maxima for each GPU generation. The maximum number of blocks in a grid is independent of the number of SMs and thus the number of “CUDA cores”.

abalducci0 · September 2, 2024, 9:04pm

@njuffa thanks. Hmm, that does sound a little weird though-- So are the blocks and threads more of a ‘programming abstraction’ than a hardware one ?

njuffa · September 2, 2024, 9:39pm

The grid limits listed in the Programming Guide (generally, (2³²-1) ·(2¹⁶-1) ·(2¹⁶-1), as noted by @striker159) are hard limits imposed by the hardware. But the hardware itself is built to support some abstraction: a grid of thread blocks is scheduled onto however many SMs are available.

abalducci0 · September 2, 2024, 10:16pm

Okay, thank you both for your help. I believe SM’s are covered in the second section, so I am not there yet.

I’ll reflect back if I am still a little confused (I already have a couple other questions in my mind, but lets see if they are covered there first).

Curefab · September 3, 2024, 1:51pm

As threads within a block are running (or at least resident on the SM) at the same time, whereas blocks can be executed serially without any guarantees, threads/block is more of an actual hardware limitation and blocks/grid is more of an abstraction.

abalducci0 · September 5, 2024, 5:02am

Hey guys (@njuffa, @striker159, @Curefab),

So I’ve been able to make it through the second section on SMs now, and their role in the block/thread picture is much clearer.

Still, one related question-- Given the SM limit obviously only so many blocks (and related threads) can be run at once. Those not running, I am presuming are stored in some sort of ‘queue’.

Is this an online or offline queue ? Or I guess what I mean is are the relevant operations (and related data) resident in memory somewhere (either on GPU/CPU) before they are run? Or is all this fetched as the blocks come out of the queue and head to the SM(s)?

I am kind of wondering with regards to memory constraint concerns for particularly large operations.

Curefab · September 5, 2024, 9:25am

There are four states of the blocks and kernels (not sure about the exact terminology):

If you use Cuda streams (and in some cases with the default stream), then kernels and other asynchronous API calls themselves are queued, until the kernel is launched.
As soon as a kernel launches some blocks are loaded into the SMs, some other blocks are kept in a queue. Those other blocks are only differentiated by their block number (in x y z dimensions within the grid). No additional memory is reserved (except if you do it manually).
The loaded blocks are called resident, their constituent warps are loaded into SM Partitions (there are 4 partitions per SM). The scheduler switches quickly between all resident warps. Only a fraction (e.g. 1/8) is selected each cycle. But all can be seen as running, similar in concept to a multitasking operating system running several processes.
Some of those resident warps are selected each cycle (one per scheduler) and their instructions are dispatched to execution units. Those execution units typically have a pipeline design, that means each cycle a new workload is inserted, but it takes several workloads until the result is ready. During the computation the scheduler switches to different warps. That means computations are done for several warps at the same time (in different stages of the computation pipelines).

The resident blocks and warps (3. + 4.) need resources, e.g. memory, but also internal HW resources.
The blocks waiting to start (1. and 2.) only such resources like memory, if you reserve it manually. You could reserve memory for buffer spaces (but buffers need only be reserved for resident blocks) or for kernel results, e.g. a filter output.
If the output of the kernel is too large, then you have to rethink your algorithm to process the output in parts instead.

abalducci0 · September 6, 2024, 3:19pm

Thanks everyone for all the help.

So I’ve been able to successfully complete and pass this course.

I guess there is just one remaining related question that is lingering-- When asynchronous streams are covered, it is mentioned that kernel calls run in order for at least any one given stream-- By this, do they mean the instance of a particular kernel (which is what I am guessing), or do they mean each successive call to a single kernel (i.e. blocks and threads)-- My guess then is that is why it is really important to take care to note the current threadIdx and blockIdx to know ‘where you are’.

Similarly, though usefully the default stream is always blocking, but in custom threads, or even with in the call of a single kernel instance with multiple blocks, it seems like you could run into all sorts of crazy race conditions(?)

Curefab · September 6, 2024, 3:45pm

A kernel call within a stream is always one stream operation. All blocks and threads finish before the stream continues.
It does not matter, if the same or a different kernel is called as next operation in the stream.

Within a kernel call the blocks and their threads run asynchronously and you possibly get all kind of crazy race conditions, yes.

That is, why you try to make blocks as independent as possible in your algorithm.

The same to a lesser degree for warps and to the least degree for the threads within a warp.

And where you have to share work or data or reconfigure, which thread is responsible for which data packet (that can make sense even within a kernel), you use synchronization primitives.

Topic		Replies	Views
Max threads/block CUDA Programming and Performance	10	22209	March 7, 2011
Scheduling Thread Blocks CUDA Programming and Performance	5	1172	July 29, 2021
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3978	February 9, 2024
Why is max threads per sm larger than max threads per block? CUDA Programming and Performance	3	1078	January 5, 2024
confusion of basic concepts CUDA Programming and Performance	8	6305	May 18, 2011
Registers per SM GTX 460 CUDA Programming and Performance	7	1912	April 17, 2011
How determine max number of blocks and threads for a GPU? CUDA Programming and Performance	4	20764	December 13, 2018
Question about threads per block and warps per SM CUDA Programming and Performance	13	15942	October 6, 2022
Maximum number of threads in a GPU CUDA Programming and Performance cuda	5	6194	December 29, 2022
a simple question about the resident blocks per multiprocessor CUDA Programming and Performance	6	3819	August 23, 2017

Max threads/blocks

Related topics