Max threads/blocks

Hi,

So I’ve just started taking the Getting Started with Accelerated Computing in CUDA C/C++ course and have completed the first section–

But I had a question regarding regarding the max threads / blocks that doesn’t seem to be mentioned.

I mean I can understand if convention says the max threads you can have per block is 1024-- But what then about the max number of blocks ? There seems no mention of this.

Or what I’m getting at, some cards have way more CUDA cores then others, so this must figure, somehow, when determining the number of blocks available, no ?

If not, then okay, but if so, how do you query the card to know how many blocks you have to work with ?

The limitations per GPU architecture are listed in the programming guide, Table 21. CUDA C++ Programming Guide

The maximum number of thread blocks for a kernel is 2^31-1 * 65535 * 65535

1 Like

Table 21 in section 16 of the CUDA Programming Guide lists the maxima for each GPU generation. The maximum number of blocks in a grid is independent of the number of SMs and thus the number of “CUDA cores”.

@njuffa thanks. Hmm, that does sound a little weird though-- So are the blocks and threads more of a ‘programming abstraction’ than a hardware one ?

The grid limits listed in the Programming Guide (generally, (232-1) ·(216-1) ·(216-1), as noted by @striker159) are hard limits imposed by the hardware. But the hardware itself is built to support some abstraction: a grid of thread blocks is scheduled onto however many SMs are available.

Okay, thank you both for your help. I believe SM’s are covered in the second section, so I am not there yet.

I’ll reflect back if I am still a little confused (I already have a couple other questions in my mind, but lets see if they are covered there first).

As threads within a block are running (or at least resident on the SM) at the same time, whereas blocks can be executed serially without any guarantees, threads/block is more of an actual hardware limitation and blocks/grid is more of an abstraction.

1 Like

Hey guys (@njuffa, @striker159, @Curefab),

So I’ve been able to make it through the second section on SMs now, and their role in the block/thread picture is much clearer.

Still, one related question-- Given the SM limit obviously only so many blocks (and related threads) can be run at once. Those not running, I am presuming are stored in some sort of ‘queue’.

Is this an online or offline queue ? Or I guess what I mean is are the relevant operations (and related data) resident in memory somewhere (either on GPU/CPU) before they are run? Or is all this fetched as the blocks come out of the queue and head to the SM(s)?

I am kind of wondering with regards to memory constraint concerns for particularly large operations.

There are four states of the blocks and kernels (not sure about the exact terminology):

  1. If you use Cuda streams (and in some cases with the default stream), then kernels and other asynchronous API calls themselves are queued, until the kernel is launched.

  2. As soon as a kernel launches some blocks are loaded into the SMs, some other blocks are kept in a queue. Those other blocks are only differentiated by their block number (in x y z dimensions within the grid). No additional memory is reserved (except if you do it manually).

  3. The loaded blocks are called resident, their constituent warps are loaded into SM Partitions (there are 4 partitions per SM). The scheduler switches quickly between all resident warps. Only a fraction (e.g. 1/8) is selected each cycle. But all can be seen as running, similar in concept to a multitasking operating system running several processes.

  4. Some of those resident warps are selected each cycle (one per scheduler) and their instructions are dispatched to execution units. Those execution units typically have a pipeline design, that means each cycle a new workload is inserted, but it takes several workloads until the result is ready. During the computation the scheduler switches to different warps. That means computations are done for several warps at the same time (in different stages of the computation pipelines).

The resident blocks and warps (3. + 4.) need resources, e.g. memory, but also internal HW resources.
The blocks waiting to start (1. and 2.) only such resources like memory, if you reserve it manually. You could reserve memory for buffer spaces (but buffers need only be reserved for resident blocks) or for kernel results, e.g. a filter output.
If the output of the kernel is too large, then you have to rethink your algorithm to process the output in parts instead.

1 Like

Thanks everyone for all the help.

So I’ve been able to successfully complete and pass this course.

I guess there is just one remaining related question that is lingering-- When asynchronous streams are covered, it is mentioned that kernel calls run in order for at least any one given stream-- By this, do they mean the instance of a particular kernel (which is what I am guessing), or do they mean each successive call to a single kernel (i.e. blocks and threads)-- My guess then is that is why it is really important to take care to note the current threadIdx and blockIdx to know ‘where you are’.

Similarly, though usefully the default stream is always blocking, but in custom threads, or even with in the call of a single kernel instance with multiple blocks, it seems like you could run into all sorts of crazy race conditions(?)

A kernel call within a stream is always one stream operation. All blocks and threads finish before the stream continues.
It does not matter, if the same or a different kernel is called as next operation in the stream.

Within a kernel call the blocks and their threads run asynchronously and you possibly get all kind of crazy race conditions, yes.

That is, why you try to make blocks as independent as possible in your algorithm.

The same to a lesser degree for warps and to the least degree for the threads within a warp.

And where you have to share work or data or reconfigure, which thread is responsible for which data packet (that can make sense even within a kernel), you use synchronization primitives.

1 Like