Hi
I have a couple of basic questions regarding gdb. I have launched a program with 1024 threads. According to info cuda sms
, only SM 0 is active.
* 0 0x00000000ffffffff
...
67 0x0000000000000000
1- Is there a way to manually dispatch the threads? I mean SM0 and SM1 each having 512 threads.
2- What does the meaning of the mask value? That 0xFFFFFFFF is 64 1-digit.
Another thing is about the limit on the number of threads. When I launch 1024 threads, the active warps are shown like this:
(cuda-gdb) info cuda warps
Wp Active Lanes Mask Divergent Lanes Mask Active Physical PC Kernel BlockIdx First Active ThreadIdx
Device 0 SM 0
* 0 0xffffffff 0x00000000 0x00000000000000f0 0 (0,0,0) (0,0,0)
1 0xffffffff 0x00000000 0x00000000000000f0 0 (0,0,0) (32,0,0)
2 0xffffffff 0x00000000 0x00000000000000f0 0 (0,0,0) (64,0,0)
...
30 0xffffffff 0x00000000 0x00000000000000f0 0 (0,0,0) (960,0,0)
31 0xffffffff 0x00000000 0x00000000000000f0 0 (0,0,0) (992,0,0)
32 0x00000000 0x00000000 n/a n/a n/a n/a
33 0x00000000 0x00000000 n/a n/a n/a n/a
...
46 0x00000000 0x00000000 n/a n/a n/a n/a
47 0x00000000 0x00000000 n/a n/a n/a n/a
That is normal, but this output is saying that there are 16 idle warps available on SM 0. But if I increase the number of threads from 1024 to 1025, warp 32 is not activated and the program returns an error.
3- Is there any way to activate warps 32-47?
Hi @mahmood.nt,
Thank you for your questions! I will address some of them and re-direct your post to a different forum branch for detailed answers for the rest.
1- Is there a way to manually dispatch the threads? I mean SM0 and SM1 each having 512 threads.
This is more a CUDA programming question (for CUDA Programming and Performance - NVIDIA Developer Forums branch), but I think it’s not possible (unless you tweak your kernel in a way, that only 512 threads would fit into singe SM).
2- What does the meaning of the mask value? That 0xFFFFFFFF is 64 1-digit.
This mask values shows that 32 warps (note that 0xFFFFFFFF
has 32 bits set to 1) are active on SM 0.
But if I increase the number of threads from 1024 to 1025, warp 32 is not activated and the program returns an error.
Again, folks on CUDA Programming and Performance - NVIDIA Developer Forums can provide more detailed answer, but this can be related to SM-level resource constrains (e.g. there might not be sufficient number of registers available to run 33 warps at the same time).
I have also moved this topic to CUDA Programming and Performance - NVIDIA Developer Forums
No, you cannot manually dispatch threads.
1025 threads per block is illegal in CUDA.
If you want to see additional warps active on a particular SM, its necessary to get multiple blocks resident on that SM. As already indicated, these things depend on your actual source code and cannot be “forced”. It’s a question of occupancy and launch configuration. Since there are 16 idle warps on the SM (32 occupied) it suggests your GPU has a limit of 1536 threads per SM. So you would want to launch threadblocks of 512 threads each, and have enough to fully fill every SM (at least 3 times the numbers of SMs on your GPU). Then (assuming occupancy calculation permits) you should see 48 active warps per SM.
Thanks for the explanations. With 68 SMs and a block of 512 threads, a launch of <<<68, 512>>> fills all SMs and in each SM, 1/3 of warps (16/48) are resident.
I then continued with <<<204, 512>>> where 204/68=3 So all SMs and warps become full and I can verify that with info cuda sms
and info cuda warps
.
When I launch <<<205, 512>>>, everything is the same as before, however, one block of 512 threads must be inactive. But I couldn’t find a command to see in-queue threads on an SM. Is there any way to see those threads waiting for resources? Or maybe inactive threads are not yet assigned to a SM. What do you think about that?
Correct. Initially, at least, the 205th block wouldn’t be assigned to an SM, until resources on a SM become available. Regarding what you can inspect in cuda-gdb, that is a question for the cuda-gdb forum.