What resources are needed for a block to run?

We know that for a block to be scheduled onto an SM, there should be enough registers, shared memory, not exceeding the limit of threads and blocks of an SM, but what else should be satisfied? I have a kernel that has 32 blocks, each having 256 threads. Each thread uses 13 registers and each block only uses 2000 Bytes of shared memory. I suppose all 32 blocks can be run concurrently, however, when I cuda-gdb it, I found that only the first 16 blocks are active. When I want to turn to the last 16 blocks, cuda-gdb says “threads not known or not active”.

  1. Can anyone tell me what the problem is?
  2. I think maybe it has something to do with bar.sync. According to PTX manual, the resources of bar is limited, but I never understand what it means? When I check the ptx file, it only uses bar.sync 0. Can anyone explain the limitation of bar.sync?

How many multiprocessors does your GPU have?

  1. I am using 9800GX2

New findings. It’s because of register limit. Each block contains 254 threads. When I squeeze register usage down to 10, all 32 blocks are running concurrently. When I see the cudaprof result, occupancy is 1.0, which means there are at least 3 blocks running concurrently on am SM. Furthermore, cta_launched is 4, so some SM are not used. Why dont use all SMs?

When I cut each thread’s register usage down to 12, all blocks run concurrently, but 13 or more is not ok. If 32 blocks run concurrently on 16 SMs, there should be 2 blocks on each SM, which means 254 * 2 = 508 threads. 8192 / 512 = 16, so I suppose if each thread uses no more than 16 registers will be ok, but why it does not work?

the number of registers of a thread is from .cubin file. Is it enough?

Hello sheepy13,

i’ve been reading about this issue and there is something i’ve not clear. It’s about the how the blocks are serially distributed to all the SM. I think you have this problem too. What we could deduce?

  • First approach.- As you mentioned, the SPA could dispatch 1 block to each SM. So first block is for first SM, … 16th block for 16th SM, 17th block again for the first SM and so on (blocksIdx start on 0, but this is only an example to dispatch the blocks). In this case, each SM has 2 blocks and 256 (threads) x 2 (blocks) x 16 (register / threads) is equal 8192 (the limit of register per SM).

  • Second approach.- The SPA calculate how many blocks can manage the SM with the current configuration. So, with 256 threads per block, each SM can manage 3 blocks. The SPA dispach 3 blocks for the first SM, another 3 blocks for the second and so on… In this case the limit of register per thread will be 10. As you mentioned, with this tactic, only 11 SM will be used (3 blocks per SM).

Whatever… if you tested that using 10 register per threads, all 32 blocks are running concurrently and some SM are not used, it seems the second approach is what is happening… of course this is only a theory of a slight cold man :P.

Could anyone confirm/deny this approach?

Thanks in advance and Keep in touch!

Thanks for reading! However, I thinks there is some misunderstanding in your reply. According to your second approach, it seems you think that register limit is determined by runtime scheduling. Actually, register usage determines block scheduling. So, I do not know whether there is a way to prove your theory. 2.2 driver provides a new register in ptx to indicate which physical SM a block is on, but its value is quite confusing.

It’s a mystery to me. I would expect it to work like you described, launching 2 blocks per SM with up to 16 registers per thread.

Here are a couple thoughts: one is occupancy might not mean the same thing if you have few enough blocks that they can all be launched. Vacancy due to exhausting resources and vacancy due to simply having few blocks might be considered differently. Just speculation though.

The other thought is maybe gdb is interacting somehow with register use. I don’t have any personal experience with cuda-gdb so this is just a guess also.

I see your point. I think it makes sense. But how about cta_launched?

I dont think so. When not using gdb, the situation is the same.

All documentation i’ve read said the same (all have the same origin at © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009)

  • Thread Life Cycle in HW
  • Thread Blocks are serially distributed to all the SM’s ( Potentially >1 Thread Block per SM ).

We dont know if they are serialized, one by one block, or threads by threads of blocks, (that’s what i mean in the example of 3 blocks of 256 threads for the first SM).

  • SM Executes Blocks
  • Threads are assigned to SMs in Block granularity

[indent]SM in G80 can take up to 768 threads

This is 24 warps (occupancy calculator!!)

Could be 256 (threads/block) * 3 blocks

Or 128 (threads/block) * 6 blocks, etc

[/indent]

In your test, you were able to launch 32 blocks of 256 threads / block using 10 register / thread. I think the scheduling is trying dispatch 3 blocks to each SM and then calculate how many register are able for thread.

Could you try a configuration of 48 blocks of 256 threads and see the result? If we are on the way of the second approach, each SM will have 3 blocks and all SM will be used.

Again, those are only thoughs. I’m not proficient with PTX and that stuff, sorry :)

Keep in touch!