In .cu , within the main() , i configued kernel part1 , part2 , part3 and part4 to
execute on device . In part1, 2 ,3 and 4 , i have declared its shared memory
(ex : 4KB per block).
when i looking at .cubin
name = Part1
lmem = 0
smem = 1XXX0 <== example
reg = 8
name = Part2
lmem = 0
smem = 1XXX1 <== example
reg = 8
name = Part3
lmem = 0
smem = 1XXX2 <== example
reg = 8
…
Why shared memory are increasing when i configued more kernels.
There is currently a known bug with the shared mem calculation if you put more than one kernel in the .cu file. The toolkit update to come shortly should fix this.
If i am using only one kernel (ex:part1) , this can concurrently process more than two blocks in a multiprocessor . But , when i am using more kernels , shared memory are also increasing to more. How many kernel’s blocks can concurently process on multiprocessor are limited by shared memory.
You need a C (host) wrapper for each kernel invocation. Then you can simply call the kernels through the wrapper from a C function by including a forward declaration to the wrapper.