It is really hard to generalize, every application is different. I mostly solve transient partial differential equations with CUDA, usually by the method of lines. There are a number of nested levels of iteration in a typical code which needs to be broken down into parallel stages, for example:

- Individual cell or element level iteration for spatial discretization, which remain serialized within each cell/element inside a set of CUDA kernels
- The spatial discretization loop across all elements/cells in a continuum, which gets parallelized by the CUDA grid-block-thread heirachy to produce a set of linear or algebraic equations to solve
- The temporal integration loop which stays on the host, with calls to a sequence of kernels/CUBLAS calls which implement the solver in parallel

I don’t know much about your genetic optimization and search algorithms, so I can’t suggest anything specific. Try doing a literature search - there are a surprising number of conference papers which describe CUDA based solutions to a wide range of problems which have popped up in the last couple of years across a very broad range of topics.