I have a large number of independent calculations that are done in groups. Each group can have, say, 14 independent runs (a kernel with 14 blocks). With that, I should be able to run two or three such groups at the same time, provided there are sufficient resources (enough shared memory, for example).
For some reason, my kernels keep getting launched sequentially, even though the test calculations are small and two of them should definitely fit on a single MP. Right now, I’m using two streams with all H2D copying done before the two kernel launches, and all D2H copying afterward. Memory operations are synchronous, since copying takes a very small share of time, compared to execution.
Are there any requirements for truly concurrent kernel execution, other than available resources? Ideally, I’d like to have 28 blocks from two kernels running on a single Tesla 2050. According to the Best Practices 4.2 guide, it should be possible: “When choosing the block size, it is important to remember that multiple concurrent blocks can reside on a multiprocessor…”. How can I get there? Or do they refer to multiple blocks belonging to the same kernel?
Thanks for any suggestions.
As far as I know, an SM can only run one kernel at a time, but multiple thread blocks from that kernel. When you launch your first kernel with 14 blocks, each block is scheduled on one SM, to spread the load evenly. And hence, the second kernel has to wait for the first to finish, since all SMs are working on the first kernel.
I don’t know of an easy way to force the 14 blocks of one kernel to 7 SMs, and the 14 blocks of the other kernel to the other 7 SMs. The only way I can think of right now is to make one big kernel, containing the code of both kernels (hint: device functions), launch 28 blocks and determine which part to run based on the block-id or SM-id.
Thanks for the suggestion, Gert-Jan.
Right now, I’m getting sequential execution even with 28 blocks requested for one kernel. For some reason, it just doesn’t want to fit more than one block in an SM. Each block requires about 17 kbytes of shared memory, no constant memory and fairly limited global memory.
If you can think of any additional requirements for having more than one block per SM, please let me know.
What about registers? The deviceQuery shows the limit of 32768 registers per block. Perhaps, in reality, it’s 32768 registers per MULTIPROCESSOR? Since my kernel uses a lot of registers (close to the listed limit), that would explain why I don’t get two blocks per SM.
Can anyone clarify the registers/block or registers/SM issue?
Yes, Table F-2 in the CUDA Programming Guide indicates that 32K is the limit on the number of registers per multiprocessor for compute capability 2.x.
You could try using less threads per block and do more work per thread instead, or you could instruct the compiler to limit its register use (which leads to some data spills).
Thanks for your suggestions.
I reduced the number of threads per block from 512 to 256, and managed to run 28 blocks in parallel for a small-sized problem. The average situation requires more shared memory, and it fell into the serial mode again (14 blocks followed by the other 14).
I guess, now I’m past the register limitation and need to reduce shared memory consumption. From my estimates, though, I use only about 17K for the average sized block, and this should definitely fit two in an SM (48K available). However, it doesn’t happen.
Is there a reliable way to determine exactly how much shared memory is used? Can shared memory be used indirectly?
Compile with [font=“Courier New”]-Xptxas=-v[/font] to see both the register and shared memory usage of your kernels.