a simple question about the resident blocks per multiprocessor

The CUDA programming shows that the maximum number of resident blocks per multiprocessor is 16 for 3.0 Compute Capability.

Here, my GTX 660 has 5 multiprocessor so that I simply think I can launch 16 * 5 = 80 blocks one time, and each multiprocessor will be automatically assigned 16 resident blocks. Is it right?

However, in fact, I can assign more than 80 blocks (i.e 1000 blocks), and the kernel also launched successfully. why?

Thread blocks may be executed out of order or even serially. So your GPU won’t have 1000 blocks running concurrently, but it will eventually execute all of them.

Thanks your reply that helps me a lot. I always consider about how many blocks per grid and how many threads per blocks I should assign for kernel since the hardware resource is limited.

Some programming guides indicate overflowed or improper assignment of grid (# of blocks and threads) will lead kernel-launch unsuccessfully. Now, the # of blocks won’t result in launch-failure, so I wonder that what kinds of cases will lead a launch-failure, and what rules I should follow to avoid the launch-failure?

Basically, don’t exceed the kernel launch bounds for your particular card. Launching more threads than supported will definitely cause a launch failure. See tech specs table below:
[url]CUDA - Wikipedia

Look at the profiling tools if you want to see what is happening when your kernel runs (Nsight, or Visual Profiler)

Adding to @vacaloca’s suggestion, at GTC14 I learned that cuda-memcheck has a “–report-api-errors” option.

It’s the ultimate solution for lazy CUDA coders!

@Gregory Diamos & all: Thanks for the reply. I also had the same belief, but you clarified it. Also, we know that the GIGA thread engine schedules the blocks serially or out of order. But till that time where are the blocks residing. The ones which are active are stored in the SM (which has regs, shared memory, etc). What about the other blocks where are they residing ( any memory location).

My confusion is let’s say we have an image 2d, which has 4832 blocks and each block has 256 threads (16x16) which is way more than all the SM occupancy. Now, each thread represents a pixel, which carries an intensity value. The thread is a part of one of the 4832 blocks.
Now, if only some of these blocks are executed on the SMs available. Then till the other blocks wait, which needs to hold the mapping of the respective pixel, where are they residing or storing their respective mapping.
I hope I am trying to put my confusion/doubt.

If not I shall try again, but help is highly appreciated, it’s bugging me a lot.

Thanks

Blocks that are not yet executing do not reside anywhere. There are no register contents to hold yet, apart from the kernel parameters which are identical for all block (and a single copy exists in constant memory), and the block and thread numbers, which can be generated.

Think of it as future loop iterations inside a sequential program: they don’t exist, yet (other than as an abstract concept or as an intention), until code execution eventually gets there.

Regarding the mapping: Don’t think of it as a large array of block id->SM id mappings that needs to be stored somewhere, because we (as CUDA end-users) cannot predict it. It can be generated “on the fly” as SMs become available, and Nvidia’s engineers are free to choose the mechanism for that, as long as every block number in the requested range is generated exactly once.