Thread managements with gdb

Hi
I have a couple of basic questions regarding gdb. I have launched a program with 1024 threads. According to info cuda sms, only SM 0 is active.

*  0 0x00000000ffffffff
...
  67 0x0000000000000000

1- Is there a way to manually dispatch the threads? I mean SM0 and SM1 each having 512 threads.
2- What does the meaning of the mask value? That 0xFFFFFFFF is 64 1-digit.

Another thing is about the limit on the number of threads. When I launch 1024 threads, the active warps are shown like this:

(cuda-gdb) info cuda warps
  Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx First Active ThreadIdx
Device 0 SM 0
*  0        0xffffffff           0x00000000 0x00000000000000f0      0  (0,0,0)                (0,0,0)
   1        0xffffffff           0x00000000 0x00000000000000f0      0  (0,0,0)               (32,0,0)
   2        0xffffffff           0x00000000 0x00000000000000f0      0  (0,0,0)               (64,0,0)
...
  30        0xffffffff           0x00000000 0x00000000000000f0      0  (0,0,0)              (960,0,0)
  31        0xffffffff           0x00000000 0x00000000000000f0      0  (0,0,0)              (992,0,0)
  32        0x00000000           0x00000000                n/a    n/a      n/a                    n/a
  33        0x00000000           0x00000000                n/a    n/a      n/a                    n/a
...
  46        0x00000000           0x00000000                n/a    n/a      n/a                    n/a
  47        0x00000000           0x00000000                n/a    n/a      n/a                    n/a

That is normal, but this output is saying that there are 16 idle warps available on SM 0. But if I increase the number of threads from 1024 to 1025, warp 32 is not activated and the program returns an error.
3- Is there any way to activate warps 32-47?

Hi @mahmood.nt,
Thank you for your questions! I will address some of them and re-direct your post to a different forum branch for detailed answers for the rest.

1- Is there a way to manually dispatch the threads? I mean SM0 and SM1 each having 512 threads.

This is more a CUDA programming question (for CUDA Programming and Performance - NVIDIA Developer Forums branch), but I think it’s not possible (unless you tweak your kernel in a way, that only 512 threads would fit into singe SM).

2- What does the meaning of the mask value? That 0xFFFFFFFF is 64 1-digit.

This mask values shows that 32 warps (note that 0xFFFFFFFF has 32 bits set to 1) are active on SM 0.

But if I increase the number of threads from 1024 to 1025, warp 32 is not activated and the program returns an error.

Again, folks on CUDA Programming and Performance - NVIDIA Developer Forums can provide more detailed answer, but this can be related to SM-level resource constrains (e.g. there might not be sufficient number of registers available to run 33 warps at the same time).

I have also moved this topic to CUDA Programming and Performance - NVIDIA Developer Forums

No, you cannot manually dispatch threads.

1025 threads per block is illegal in CUDA.

If you want to see additional warps active on a particular SM, its necessary to get multiple blocks resident on that SM. As already indicated, these things depend on your actual source code and cannot be “forced”. It’s a question of occupancy and launch configuration. Since there are 16 idle warps on the SM (32 occupied) it suggests your GPU has a limit of 1536 threads per SM. So you would want to launch threadblocks of 512 threads each, and have enough to fully fill every SM (at least 3 times the numbers of SMs on your GPU). Then (assuming occupancy calculation permits) you should see 48 active warps per SM.

Thanks for the explanations. With 68 SMs and a block of 512 threads, a launch of <<<68, 512>>> fills all SMs and in each SM, 1/3 of warps (16/48) are resident.
I then continued with <<<204, 512>>> where 204/68=3 So all SMs and warps become full and I can verify that with info cuda sms and info cuda warps.

When I launch <<<205, 512>>>, everything is the same as before, however, one block of 512 threads must be inactive. But I couldn’t find a command to see in-queue threads on an SM. Is there any way to see those threads waiting for resources? Or maybe inactive threads are not yet assigned to a SM. What do you think about that?

Correct. Initially, at least, the 205th block wouldn’t be assigned to an SM, until resources on a SM become available. Regarding what you can inspect in cuda-gdb, that is a question for the cuda-gdb forum.