Shared memory is lifetime of block?

Part1<<< grid1, threads1 >>>( d_X , d_B , d_r , p_sum , d_module);
Part2<<< grid1, threads1 >>>( d_X , d_B , d_r , p_sum , d_module);
Part3<<< grid1, threads1 >>>( d_X , d_B , d_r , p_sum , d_module);
Part4<<< grid1, threads1 >>>( d_X , d_B , d_r , p_sum , d_module);

In .cu , within the main() , i configued kernel part1 , part2 , part3 and part4 to
execute on device . In part1, 2 ,3 and 4 , i have declared its shared memory
(ex : 4KB per block).

when i looking at .cubin

name = Part1
lmem = 0
smem = 1XXX0 <== example
reg = 8

name = Part2
lmem = 0
smem = 1XXX1 <== example
reg = 8

name = Part3
lmem = 0
smem = 1XXX2 <== example
reg = 8

Why shared memory are increasing when i configued more kernels.

There is currently a known bug with the shared mem calculation if you put more than one kernel in the .cu file. The toolkit update to come shortly should fix this.


there no method to solve this problem now?

If i am using only one kernel (ex:part1) , this can concurrently process more than two blocks in a multiprocessor . But , when i am using more kernels , shared memory are also increasing to more. How many kernel’s blocks can concurently process on multiprocessor are limited by shared memory.

Put them into separate .cu files.


I have tried to configue a kernel in another .cu, and using a function in

main .cu to call this kernel. But , i still must “#include” this .cu into main .cu.

How to separate this problem?

You need a C (host) wrapper for each kernel invocation. Then you can simply call the kernels through the wrapper from a C function by including a forward declaration to the wrapper.


it’s ok now. very thank you :)