Speeding up genetic algorithms A clueless beginner looking for some advice :-)

Hi all

I’m eager to get involved in the world of GPU-computing but need some advice before I take the plunge. I have done a little bit of reading but remain unsure about a few issues.

At the moment I am working on a fairly large academic project, part of which entails developing/implementing a genetic-algorithm to solve a nasty optimisation problem. My initial hope is to be able to exploit the power of GPU computing to speed up the genetic algorithm.

Without going into any concrete detail, one of the key characteristics of my genetic algorithm is that it runs for very many iterations where, at each iteration, there are many (let’s say no more than 200) sets of mathematically identical, CPU-intensive (e.g. involving difficult function evaluations) but independent calculations that need to be carried out. In addition to this, at the end of each iteration, some fairly minor processing needs to be applied to the combined results from all the independent calculations. Therefore the bulk of the algorithm seems amenable to a parallel implementation. I am using MATLAB.

My questions:

i.) if I use a card with, say, “200 CUDA cores”, does this mean in practical terms that I’ll readily be able to exploit each core to carry out a set of the aforesaid independent calculations at each iteration (suggesting an approx. 100-fold speedup in the algorithm), or is this wishful thinking?

ii.) my budget for the project will probably allow me to purchase a Tesla card - would this be worthwhile compared to buying, say, something like the GTX 580? The faster I can carry out each set of (in principle, fully parallel/independent) calculations, the better, but I’ll not need to carry out more than about 200 sets of calculations simultaneously. As for the amount of data involved in each independent calculation - probably several kb at most.

iii.) MATLAB’s Parallel Computing Toolbox doesn’t seem to support some of the functions I’ll want to run on the GPU (e.g. eig, if etc.) - is it easy to write my own overloaded functions or should I use Jacket, which seems to support rather more MATLAB functions?

iv.) any general comments on whether either Jacket or MATLAB’s Parallel Computing Toolbox would be better suited to my work?

I’d really appreciate any comments as I am totally new to the world of GPU computing. Thanks in advance.

200 parallel task is rather low for a GPU and will not even saturate the low-end cards, which are no faster than a (quad-core) CPU. If you cannot extract more parallelism from your problem (e.g., extract parallelism from each individual task, or do several iterations in parallel), your problem does not seem a good fit for computation on a GPU.

As a rule of thumb, you want at least about 24x as many threads (tasks) as there are cores on the GPU, in order to hide instruction latencies.

Tesla cards aren’t faster than GTX consumer parts (apart from double-precision calculations, which under certain conditions can be up to almost 4x faster on Tesla cards). If your budget allows for a Tesla GPU, but you can’t extract more parallelism from the problem, you might be better off buying a computer with 8 or more CPU cores.

Hi Dr Boognish,

Thanks for this detailed post. Here are some thoughts (I work at AccelerEyes):

This code sounds like a prime candidate for Jacket’s GFOR loop. With GFOR, you can run many FOR-loop iterations simultaneously on the GPU. Learn more about how that works, here: http://wiki.accelereyes.com/wiki/index.php/GFOR_Usage

As long as you are not using the “combined results” from one iteration as input data to subsequent iterations, GFOR lets you do the combining at the end of the loop.

CUDA cores are very different than CPU cores. Generally speaking, CUDA cores are good at exploiting data parallelism while CPU cores are more focused on task parallelism, http://blog.accelereyes.com/blog/2009/01/22/data-parallelism-vs-task-parallelism/. Jacket’s GFOR function is able to convert your FOR-loop iterations (task-parallelism) into a big data-parallel problem (i.e. running lot’s of data through a computation - just with different values in the loop iterators for each tile of data). So rather than thinking about each CUDA core handling an iteration of its own, you should simply think about throughput and that the CUDA cores will work in concert to get the total execution finished faster than otherwise.

The more CUDA cores, the more throughput you will be able to achieve. One major difference between Tesla and GeForce is the size of memory available on the card.

For a comparsion, see http://www.accelereyes.com/products/compare

Tera speaks about needing more threads than cores. This is accurate. However, 200 parallel iterations of your FOR-loop likely translate to way more than 200 parallel tasks (depending on what you have in the body of your loop and depending upon your matrix sizes). As I mentioned above, CPU threads (wherein one thread might be assigned to handle one loop iteration of a loop) are nothing like GPU threads and should not be equated.

Good luck and let us know if there is anything else we can do to help!

Thank you both for the replies.

@Tera, I don’t understand what you mean when you say I need to extract more parallelism from the task. Indeed, how is it possible to achieve more parallelism than having N chunks of calculations that can run entirely independently of each other? Perhaps I should add that each one of the N (say N=200) sets of calculations will in itself be quite complicated. For example, each independent “calculation” in the set will involve solving a system of hundreds of polynomial equations. So my idea would be that each CUDA core can work to solve, say, such a system of equations. Sans a GPU basically I’d be solving hundreds of systems of equations (amongst other things), one after the other, when in fact (in principle) I could be solving all of them independently of each other.

@Melonakos - thank you for the very detailed response. The GFOR loop sounds like exactly what I need. I’m still struggling to understand what you say about CUDA cores and task vs. data parallelism. If you wouldn’t mind, perhaps you could clarify this with regard to my specific problem. To simplify things, let’s say that I simply want to solve 200 independent systems of equations (forgetting about the for loop, which basically means that I’d want to solve 200 systems at every iteration). The methods I use to solve the equations are identical for every system; only, the numbers involved are different. So, to what extent is there task and to what extent is there data parallelism involved here?

Thank you both once again; I appreciate all the help.

When I’m talking about more parallelism, I just mean a larger N. GPUs prefer having tens of thousands of individual tasks (threads). As John already mentioned, there are two main directions to achieve this for your problem:


Modify your genetic optimization algorithm so that no generation depends on the previous K generations, so that you can run K generations in parallel.

Parallelize each of the 200 tasks so that M threads can work on it collectively. The matrix inversion seems to be an ideal candidate for this.

This will then allow you to use K×M×N parallel tasks, getting closer to the tens of thousands of tasks needed for a GPU.

Sounds like your problem is completely data-parallel (i.e. this is the definition of data-parallel, “The methods I use to solve the equations are identical for every system; only, the numbers involved are different”). In other words, you could simply tile the data from each iteration into one big data structure and then pipe that full data structure through the processor in step with the body of the loop.

That is what GFOR does for you automatically.

Task parallel would be to have separate methods to be used on the data (which is not the case here).

Hope this helps. Feel free to shoot me an email (see signature) if I can be useful to you.

As others said, to achieve full load on all 200 cores, you need a lot more than 200 threads. Maybe not tens of thousands, but at least 2000.

Once you resolve that, it comes down to the nature of calculations. 100-fold speedup is almost certainly wishful thinking, especially with only 200 cores (assuming that your CPU code was optimized for SIMD), but it really depends on what you do in that iteration.

The GPU hardware is tailored to perform basic arithmetics and certain transcendental functions (sin, cos, sqrt, log, exp) on single precision (32-bit) floats, and arithmetics (except divisions) on 32-bit integers. Each core of a GTX 580 can do 1.5G basic floating-point operations per second. Each core of an Intel Core i7 can do about 4G such operations per second. So a 512-core GPU would be about 50 times faster on these tasks than a 4-core CPU.

The speedup will be bigger if you rely heavily on transcendentals. 32-bit floating-point cosine is 8 clocks per core on a GTX 580 (throughput 200M ops/second/core). Native FPU cosine in an Intel Core i7 is ~100 clocks (throughput 30-40M ops/second/core) (but there may be optional libraries that speed up parallel cosines through SIMD).

The speedup will be smaller if you need to use any tools that the device wasn’t originally intended for, such as double precision (64-bit) floating point, especially on non - compute capability 2.0 chips (in other words, on anything except 470/480, 570/580, and their Tesla/Quadro counterparts).

If you are OK with 32-bit floating point, and you don’t have enough parallelism to fully load all cores at the same time, you might actually be better off with a cheaper card such as a 560, rather than a 580. The 560 is a smaller chip, and therefore it can work at a higher frequency at the same level of power dissipation. 560’s come with higher core frequencies out of the box, and they can handle more overclocking. Reportedly, 560’s are quite stable at 1000 and even 1050 MHz, whereas 580 does not OC much higher than 850 MHz.