a simple question about the resident blocks per multiprocessor

superhippo · March 29, 2014, 10:47pm

The CUDA programming shows that the maximum number of resident blocks per multiprocessor is 16 for 3.0 Compute Capability.

Here, my GTX 660 has 5 multiprocessor so that I simply think I can launch 16 * 5 = 80 blocks one time, and each multiprocessor will be automatically assigned 16 resident blocks. Is it right?

However, in fact, I can assign more than 80 blocks (i.e 1000 blocks), and the kernel also launched successfully. why?

Gregory_Diamos · March 30, 2014, 2:41am

Thread blocks may be executed out of order or even serially. So your GPU won’t have 1000 blocks running concurrently, but it will eventually execute all of them.

superhippo · March 30, 2014, 5:03pm

Thanks your reply that helps me a lot. I always consider about how many blocks per grid and how many threads per blocks I should assign for kernel since the hardware resource is limited.

Some programming guides indicate overflowed or improper assignment of grid (# of blocks and threads) will lead kernel-launch unsuccessfully. Now, the # of blocks won’t result in launch-failure, so I wonder that what kinds of cases will lead a launch-failure, and what rules I should follow to avoid the launch-failure?

vacaloca · March 30, 2014, 5:21pm

Basically, don’t exceed the kernel launch bounds for your particular card. Launching more threads than supported will definitely cause a launch failure. See tech specs table below:
[url]CUDA - Wikipedia

Look at the profiling tools if you want to see what is happening when your kernel runs (Nsight, or Visual Profiler)

allanmac · March 30, 2014, 5:33pm

Adding to @vacaloca’s suggestion, at GTC14 I learned that cuda-memcheck has a “–report-api-errors” option.

It’s the ultimate solution for lazy CUDA coders!

shaurakar · August 23, 2017, 6:52pm

@Gregory Diamos & all: Thanks for the reply. I also had the same belief, but you clarified it. Also, we know that the GIGA thread engine schedules the blocks serially or out of order. But till that time where are the blocks residing. The ones which are active are stored in the SM (which has regs, shared memory, etc). What about the other blocks where are they residing ( any memory location).

My confusion is let’s say we have an image 2d, which has 4832 blocks and each block has 256 threads (16x16) which is way more than all the SM occupancy. Now, each thread represents a pixel, which carries an intensity value. The thread is a part of one of the 4832 blocks.
Now, if only some of these blocks are executed on the SMs available. Then till the other blocks wait, which needs to hold the mapping of the respective pixel, where are they residing or storing their respective mapping.
I hope I am trying to put my confusion/doubt.

If not I shall try again, but help is highly appreciated, it’s bugging me a lot.

Thanks

tera · August 23, 2017, 7:26pm

Blocks that are not yet executing do not reside anywhere. There are no register contents to hold yet, apart from the kernel parameters which are identical for all block (and a single copy exists in constant memory), and the block and thread numbers, which can be generated.

Think of it as future loop iterations inside a sequential program: they don’t exist, yet (other than as an abstract concept or as an intention), until code execution eventually gets there.

Regarding the mapping: Don’t think of it as a large array of block id->SM id mappings that needs to be stored somewhere, because we (as CUDA end-users) cannot predict it. It can be generated “on the fly” as SMs become available, and Nvidia’s engineers are free to choose the mechanism for that, as long as every block number in the requested range is generated exactly once.

Topic		Replies	Views
How determine max number of blocks and threads for a GPU? CUDA Programming and Performance	4	20694	December 13, 2018
Execution Of Thread-Blocks CUDA Programming and Performance	4	5281	June 18, 2007
Max threads/blocks CUDA Programming and Performance	10	78	September 6, 2024
Synchronizing Blocks CUDA Programming and Performance	3	2423	January 10, 2018
finding the best number of threads per block CUDA Programming and Performance	3	7846	January 29, 2010
max number of block CUDA Programming and Performance	21	17615	April 20, 2010
Mapping between CUDA cores and threads CUDA Programming and Performance	7	15386	December 2, 2011
Max # of blocks? CUDA Programming and Performance	10	9975	November 28, 2007
I wonder maximum number of threads per block really limits the number of threads in each block. CUDA Programming and Performance	5	3977	February 9, 2024
Number of blocks parameter for kernel when GPU has just one SM CUDA Programming and Performance	3	510	August 4, 2017

a simple question about the resident blocks per multiprocessor

Related topics