nested Loops Best way to CUDA program Nested Loops

Many of the example programs in the CUDA software show how to convert from do loops in c to do loops in CUDA. The matrix multiplication is a good example. However, in some of my software code that I am looking to speed up the do loops are nested. How does one handle those?

I know the best way is to unravel the nested loops and then program each loop in CUDA as shown in the example given with the software.

But sometimes that may not be possible. So how does one handle nested do-loopa in CUDA?



It is really hard to generalize, every application is different. I mostly solve transient partial differential equations with CUDA, usually by the method of lines. There are a number of nested levels of iteration in a typical code which needs to be broken down into parallel stages, for example:

  1. Individual cell or element level iteration for spatial discretization, which remain serialized within each cell/element inside a set of CUDA kernels
  2. The spatial discretization loop across all elements/cells in a continuum, which gets parallelized by the CUDA grid-block-thread heirachy to produce a set of linear or algebraic equations to solve
  3. The temporal integration loop which stays on the host, with calls to a sequence of kernels/CUBLAS calls which implement the solver in parallel

I don’t know much about your genetic optimization and search algorithms, so I can’t suggest anything specific. Try doing a literature search - there are a surprising number of conference papers which describe CUDA based solutions to a wide range of problems which have popped up in the last couple of years across a very broad range of topics.