I’m a newbie at CUDA programming, but I need use it in a complex project. I really need some help.
My question is if I want to execute a child kernel 256 times concurrently what can I do with Dynamic Parallelism?
I read a blog https://devblogs.nvidia.com/cuda-dynamic-parallelism-api-principles/, and it says:“By default, grids launched within a thread block are executed sequentially: the next grid starts executing only after the previous one has finished. This happens even if grids are launched by different threads within the block.”
So, my idea is setting block size(1,1) and grid size(256,1) for the parent kernel and I can launch the child kernel concurrently with 256 threads in different blocks. Will it be very inefficient? What’s the better solution?
Thank you for your help!