Deep understanding how block is actually processed in MP

16k regs and 1024 threads are available in 1.2-1.3, while 1.1 in my 9800 has 8k regs and 768 threads per MP.
Am I right that resident thread for MP is similar to resident process for ordinary CP and CUDA MP is able to keep resident threads from different blocks?

Not really no. There is a vague similarity in that each thread has state (registers, local memory) that must be managed during context switching, but otherwise threads lack 99% of the “autonomy” of a typical OS process, or even lightweight thread. Threads execute in groups of 32 (“warps”) and every thread in a warp comes from the same block.

Sort of. A MP can keep more than one block resident if resource availability allows it. Blocks cannot be split over different MP. So there can be resident threads from different blocks on an MP at a given time, but the threads are not the unit at which MP scheduling occurs, the block is. A given block is then broken into warps of 32 threads and the threads are executed in SIMD fashion on the MP cores. On your CC1.1 device, each instruction gets executed four times per warp on the 8 MP cores.

In other words, threads of the same “pack” that are resident at once can switched seamlessly, but threads from different “packs” (resident and not resident) can not be switched that easily, right?
And, is it right that “pack” of resident threads is kept on MP until they all finish their path and another “pack” is not being let go?

EDIT: noone looked at the attachment :)
By profiler info, I get occupancy 1 for 2 blocks of 384 threads, 3 blocks of 256 threads, 4 blocks of 192 threads, etc. Does this actually mean that all these blocks operate on the same MP?
I tend to treat it as a profiler error, because performance test definitely shows that for 1…14 blocks only 1 block is present per MP.

If you are going to continue to ask questions about this topic, at least try and stick to the terminology that everybody understands: I have no idea what you are trying to refer with “pack”.

thread -> warp (32 threads) -> block (one or more warps) -> grid (one or more blocks)

All this is described and illustrated in Chapter 2 of the CUDA programming guide. Now might be a time to acquaint or reacquaint yourself with its contents.

Sorry, by “pack” I mean threads, that are resident on MP. As long as I know, this is not similar to block (up to 8 blocks can be resident on MP) and especially warp. For CC 1.1 “Pack” is 768 threads, for 1.2-1.3 it is 1024 threads.
I just don’t know how to name it properly, sorry again. I do know about grid, block, warp etc. Programming guide contains few information about “resident warps/threads” which means that “Max number of resident warps/threads” should be obvious, but it’s not obvious to me :(

There is no such thing.

As I said in an earlier post, MP are given complete blocks to run by the GPU scheduler, not threads. Any given block only ever runs on one MP, and it remains “resident” or “active” until all threads in the block have finished, then the block is retired. Blocks (and by extension all the threads in a given block) are never context switched out of the MP they run on. Once they become resident, they are executed until they finish. The number of active threads on a MP is always the number of active blocks * the number of threads per block. This is often less than the maximum threads per MP number you seem to be fixating on.

Thanks all, now I think I understand. Sorry again :)

But how about this?
By profiler info, I get occupancy 1 for 2 blocks of 384 threads, 3 blocks of 256 threads, 4 blocks of 192 threads, etc. Does this actually mean that all these blocks operate on the same MP? As long as I thought, 4 blocks of 192 threads should give occupancy 0.25
I tend to treat it as a profiler error, because performance test definitely shows that for 1…14 blocks only 1 block is present per MP.

Yes. Occupancy is defined as the number of active warps over the maximum number of active warps per MP. So in each case you have the maximum number of active warps (24 = 768 threads per MP), so the occupancy is 1.

4 * (192/32) = 24 = 768 threads per MP. So occupancy is 1.

It isn’t a profiling error. The consensus is that scheduling is done “round robin” on free MP. So for a GT200 with 30MP, 30 blocks = 1 block per MP, 60 blocks = 2 blocks per MP, until the maximum per MP is achieved (so you would need 120 blocks of your 192 thread per block kernel to reach full occupancy on all MP). Additional blocks beyond the number required to “fill” the GPU wait until MP are free, and then are scheduled in the same fashion.

I meant that total of 4 blocks of 192 threads on a 14-MP GPU give total occupancy. Not 14*4 blocks of 192 threads.
Actually, after 7 total blocks, graph of “occupancy VS number of threads” (or number of warps) remains the same, independent of the amount of blocks.