I’m fairly new to CUDA and after playing around with a few simple examples, am wanting to try out my first fairly complex algorithm port.
I’ve been looking at different examples and notices a few approaches that different people are taking:
implement the entire algorithm in a single CUDA kernel. This could include multiple loops and conditional branches.
implement the algorithm “outline” on the host, and call multiple small CUDA kernels to handle the different loop parallelism opportunities within the main algorithm.
I would think that the second approach would allow for better optimizations within the GPU for memory and instruction coalescing (assuming there isn’t much overhead in calling a kernel), but I’m not sure. Also, with option 2 it would assume that memory copying was kept to a minimum between the different CUDA kernel calls.
I know this is a pretty general question, but was hoping there were some outlines