Strategies for implementing a large algorithm in C

I’m fairly new to CUDA and after playing around with a few simple examples, am wanting to try out my first fairly complex algorithm port.

I’ve been looking at different examples and notices a few approaches that different people are taking:

  1. implement the entire algorithm in a single CUDA kernel. This could include multiple loops and conditional branches.

  2. implement the algorithm “outline” on the host, and call multiple small CUDA kernels to handle the different loop parallelism opportunities within the main algorithm.

I would think that the second approach would allow for better optimizations within the GPU for memory and instruction coalescing (assuming there isn’t much overhead in calling a kernel), but I’m not sure. Also, with option 2 it would assume that memory copying was kept to a minimum between the different CUDA kernel calls.

I know this is a pretty general question, but was hoping there were some outlines

This may go without saying, but you may want to examine your algorithm to see if there are any computational loops that can be “unrolled” – if you are looping through arrays and multiplying by a value, etc., it might be faster to use matrix multiplication through cuBLAS for that part of your algorithm.

I haven’t gotten to working on any actual C kernels yet (my research is mostly in linear algebra/optimization theory), but I think you would want to take approach #2 as you’ve outlined above. Again, if you can identify parts of the algorithm where repetitive (but non-dependent) calculations are running, those can probably be parallelized through CUDA. Also, from what I’ve seen so far, CUDA seems to work just as well in many small batches of data as it does in a few large batches…but the small route is probably better if you will be running the program on various different cards.

This depends on various factors. First, how much time does the whole thing really take to execute (on a CPU)? This is just expanding on the next question: are the bits of your algorithm meaty enough to push separately into CUDA? Finally, is the whole algorithm run once, or is it executed multiple times? You can’t push the whole algorithm in if you’re left with no additional room to parallelize.

If your code is sufficiently large, then option 2 will almost always be the best answer. The biggest problem with large kernels is register usage. In order to hide memory latency, you want to be able to have at least 192 active threads per multiprocessor, which means that each kernel should use at most 42 registers on compute 1.0 and 1.1 devices. If your kernel really uses too many registers, then you’ll have the extra bonus of having local variables spill over into local memory, which is as slow as global memory. This will deliver even more of a performance hit.