Strategies for implementing a large algorithm in C

jconwell · September 2, 2008, 6:14pm

I’m fairly new to CUDA and after playing around with a few simple examples, am wanting to try out my first fairly complex algorithm port.

I’ve been looking at different examples and notices a few approaches that different people are taking:

implement the entire algorithm in a single CUDA kernel. This could include multiple loops and conditional branches.
implement the algorithm “outline” on the host, and call multiple small CUDA kernels to handle the different loop parallelism opportunities within the main algorithm.

I would think that the second approach would allow for better optimizations within the GPU for memory and instruction coalescing (assuming there isn’t much overhead in calling a kernel), but I’m not sure. Also, with option 2 it would assume that memory copying was kept to a minimum between the different CUDA kernel calls.

I know this is a pretty general question, but was hoping there were some outlines

jack · September 2, 2008, 6:45pm

This may go without saying, but you may want to examine your algorithm to see if there are any computational loops that can be “unrolled” – if you are looping through arrays and multiplying by a value, etc., it might be faster to use matrix multiplication through cuBLAS for that part of your algorithm.

I haven’t gotten to working on any actual C kernels yet (my research is mostly in linear algebra/optimization theory), but I think you would want to take approach #2 as you’ve outlined above. Again, if you can identify parts of the algorithm where repetitive (but non-dependent) calculations are running, those can probably be parallelized through CUDA. Also, from what I’ve seen so far, CUDA seems to work just as well in many small batches of data as it does in a few large batches…but the small route is probably better if you will be running the program on various different cards.

alex_dubinsky · September 2, 2008, 7:07pm

I’m fairly new to CUDA and after playing around with a few simple examples, am wanting to try out my first fairly complex algorithm port.

I’ve been looking at different examples and notices a few approaches that different people are taking:

implement the entire algorithm in a single CUDA kernel. This could include multiple loops and conditional branches.

implement the algorithm “outline” on the host, and call multiple small CUDA kernels to handle the different loop parallelism opportunities within the main algorithm.

I would think that the second approach would allow for better optimizations within the GPU for memory and instruction coalescing (assuming there isn’t much overhead in calling a kernel), but I’m not sure. Also, with option 2 it would assume that memory copying was kept to a minimum between the different CUDA kernel calls.

I know this is a pretty general question, but was hoping there were some outlines

[snapback]434495[/snapback]

This depends on various factors. First, how much time does the whole thing really take to execute (on a CPU)? This is just expanding on the next question: are the bits of your algorithm meaty enough to push separately into CUDA? Finally, is the whole algorithm run once, or is it executed multiple times? You can’t push the whole algorithm in if you’re left with no additional room to parallelize.

StickGuy · September 3, 2008, 7:15am

If your code is sufficiently large, then option 2 will almost always be the best answer. The biggest problem with large kernels is register usage. In order to hide memory latency, you want to be able to have at least 192 active threads per multiprocessor, which means that each kernel should use at most 42 registers on compute 1.0 and 1.1 devices. If your kernel really uses too many registers, then you’ll have the extra bonus of having local variables spill over into local memory, which is as slow as global memory. This will deliver even more of a performance hit.

Topic		Replies	Views
A (not so) hypothetical question CUDA Programming and Performance	6	1640	March 24, 2009
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	512	July 11, 2019
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
Hide latency CUDA Programming and Performance	3	514	June 9, 2023
Genetic Algorithm need a strategie CUDA Programming and Performance	8	13653	June 12, 2009
reasons why splitting large kernel to smaller one lower perfromance CUDA Programming and Performance	4	3735	February 15, 2016
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1741	August 1, 2013
Large Kernel and Small Kernels - which is better ? CUDA Programming and Performance	6	5597	March 13, 2010
Using shared memory where a variable number of threads shares some data. CUDA Programming and Performance	3	4310	May 14, 2011
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2363	December 9, 2011

Strategies for implementing a large algorithm in C

Related topics