Concurrent Kernels On A Given Multiprocessor

Sasha_Buzko · May 24, 2012, 6:24pm

Hi all,
I have a large number of independent calculations that are done in groups. Each group can have, say, 14 independent runs (a kernel with 14 blocks). With that, I should be able to run two or three such groups at the same time, provided there are sufficient resources (enough shared memory, for example).
For some reason, my kernels keep getting launched sequentially, even though the test calculations are small and two of them should definitely fit on a single MP. Right now, I’m using two streams with all H2D copying done before the two kernel launches, and all D2H copying afterward. Memory operations are synchronous, since copying takes a very small share of time, compared to execution.

Are there any requirements for truly concurrent kernel execution, other than available resources? Ideally, I’d like to have 28 blocks from two kernels running on a single Tesla 2050. According to the Best Practices 4.2 guide, it should be possible: “When choosing the block size, it is important to remember that multiple concurrent blocks can reside on a multiprocessor…”. How can I get there? Or do they refer to multiple blocks belonging to the same kernel?

Thanks for any suggestions.

Sasha

Gert-Jan · May 25, 2012, 11:43am

As far as I know, an SM can only run one kernel at a time, but multiple thread blocks from that kernel. When you launch your first kernel with 14 blocks, each block is scheduled on one SM, to spread the load evenly. And hence, the second kernel has to wait for the first to finish, since all SMs are working on the first kernel.

I don’t know of an easy way to force the 14 blocks of one kernel to 7 SMs, and the 14 blocks of the other kernel to the other 7 SMs. The only way I can think of right now is to make one big kernel, containing the code of both kernels (hint: device functions), launch 28 blocks and determine which part to run based on the block-id or SM-id.

Sasha_Buzko · May 26, 2012, 12:18am

Thanks for the suggestion, Gert-Jan.
Right now, I’m getting sequential execution even with 28 blocks requested for one kernel. For some reason, it just doesn’t want to fit more than one block in an SM. Each block requires about 17 kbytes of shared memory, no constant memory and fairly limited global memory.
If you can think of any additional requirements for having more than one block per SM, please let me know.
Thanks again

Sasha

Sasha_Buzko · May 29, 2012, 5:01pm

What about registers? The deviceQuery shows the limit of 32768 registers per block. Perhaps, in reality, it’s 32768 registers per MULTIPROCESSOR? Since my kernel uses a lot of registers (close to the listed limit), that would explain why I don’t get two blocks per SM.
Can anyone clarify the registers/block or registers/SM issue?

Thanks

Sasha

seibert · May 29, 2012, 5:37pm

Yes, Table F-2 in the CUDA Programming Guide indicates that 32K is the limit on the number of registers per multiprocessor for compute capability 2.x.

cbuchner1 · May 30, 2012, 1:37pm

You could try using less threads per block and do more work per thread instead, or you could instruct the compiler to limit its register use (which leads to some data spills).

Sasha_Buzko · May 30, 2012, 5:27pm

Thanks for your suggestions.
I reduced the number of threads per block from 512 to 256, and managed to run 28 blocks in parallel for a small-sized problem. The average situation requires more shared memory, and it fell into the serial mode again (14 blocks followed by the other 14).

I guess, now I’m past the register limitation and need to reduce shared memory consumption. From my estimates, though, I use only about 17K for the average sized block, and this should definitely fit two in an SM (48K available). However, it doesn’t happen.
Is there a reliable way to determine exactly how much shared memory is used? Can shared memory be used indirectly?

Thanks

Sasha

tera · May 30, 2012, 7:29pm

Compile with [font=“Courier New”]-Xptxas=-v[/font] to see both the register and shared memory usage of your kernels.

Topic		Replies	Views
Max 1 or 2 concurrent kernels per SM? CUDA Programming and Performance	19	11702	May 22, 2014
Kernel Execution issues related to Shared Memory CUDA Programming and Performance	5	5155	November 9, 2009
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2477	July 4, 2019
Concurrent kernel CUDA Programming and Performance	8	1592	January 14, 2024
More blocks than SMs may not make sense CUDA Programming and Performance	13	2674	November 11, 2010
Shared Memory and number of Blocks invoked CUDA Programming and Performance	4	5739	March 5, 2008
Max blocks per SM less than expected CUDA Programming and Performance	5	1332	May 16, 2017
Amount of usable shared memory? CUDA Programming and Performance	2	2219	May 31, 2012
Execution Of Thread-Blocks CUDA Programming and Performance	4	5282	June 18, 2007
Shared memory and register usage - just 1 thread/block CUDA Programming and Performance	1	794	July 21, 2009

Concurrent Kernels On A Given Multiprocessor

Related topics