Genetic Algorithm

Anyone know where i can get my hands on a Simple Genetic Algorithm in Cuda?


I`m planing to build universal kernel for GA using CUDA. If you found any implementation of GA with CUDA please let me know :)

This is interesting to me. Isn’t the GA (or its performance anyway) strongly dependent on the particular system being evolved? To create the phenotype from the genotype and evaluate the fitness of the genotype, I would expect that you would have to implement those specific functions for cuda. Certainly possible, but how do you intend to make it universal?

It all depends on your fitness function. If it takes time to calculate, run the GA on the CPU with parallel evaluations of the fitness function on the GPU. The genetic algorithm itself isn’t computationally demanding and is essentially serial in nature (per generation). So unless you have a heavyweight fitness function, no point in using CUDA really.

You might want to read some of the papers at GPGPGPU

What’s wrong with just running a bunch of ‘genes’ through the fitness function in parallel? Since you could run so many at once “for free”, it would also be a good way to avoid local minima or find other saddle points in the function, etc. Anyway, that’s the way I’ve always done it (just not on the GPU, which does make it more interesting).

Nothing wrong with it at all, in fact I think it’s a natural fit. Though I’m skeptical about a universal cuda kernel, because there is no universal fitness function, and the fitness function is exactly what you would want to port to the GPU.

I agree…I was a bit skeptical about that claim as well. However, it wouldn’t be out of the question to build a generic, universal framework that you could plug your fitness function into (as well as the data type for your ‘gene’), and thus be able to quickly port genetic algorithms to the GPU (since you’d really only need to deal with the fitness function itself).

By universal kernel i meant building a framework that implements common selection algorithms ale genetic operations for basic genotype structures. In most cases user of that framework will only have to write fitness function and set algorithm parameters (population size, genetic operations (plus probability), selection algorithm (plus pressure), …). It will also be a possibility to write own genetic operands and apply own genotype structures. What do you thing about that conception?

jjtapiav tanks for link :)

My 1 cent :-)

The conception is perfect, however, looks like you are a bit optimistic about the amount of work to be done.

I’m continuously working on the genetic solver (genetic programming problem solver based on tree genome) for 1.5 years starting from fundamental Koza’s books and very basic but well-structured open source GPC++. This work has been started for my particular needs, however, whole application design is made as universal as possible in order to be able to use it for almost any task.

At this moment, overall size of source codes involved into the project is greater than 1 megabyte (to say nothing about various useful 3rd party tools, for example, nedmalloc memory manager), genetic solver is actually able to try to solve some tasks, however, the To Do list is still huge.

So, if this project is ‘just for fun’ thing - may be it’s better to play with GPC++ for the beginning (

No you can’t do any universal GA on GPU.

Right now I’m trying to make a GPU accelerated Genetic Algorithm that optimizes neural networks. Since neural networks are all about matrix multiplication then Cuda is ideal tool for this task. In my case i think it would be great if I will be able to pull the following parts of GA implementation to GPU:

  1. Fitness function evaluation
  2. Crossover and Gray code implementation (in this case it is even hard to imagine how kernel invocation call should look like)
    First part is easier to accomplish since it only requires multiple matrices multiplication and error estimation. The second part is much harder since it will require dealing with random numbers. In this case it will be extremely hard to do any optimizations. But this part is crucial, if it will be accomplished then it will be possible to do all the computations on GPU without any copies back from device to host (excluding final result copying). But providing a general GA implementation on GPU will be a struggle between universality and performance. The same as it would be with CPU version of such algorithm. General versions of GA are useful only for educational purposes but not for real world applications.

I think that nobody expects to move whole genetic process to GPU … the main goal is to speed it up and to make it as general as possible.

In my case, evaluation of each tree on each test case is done on GPU (the output of GPU calculations is a set of float numbers, one per each input test case). All other things (genetic operations as well as fitness evaluation based on the output from GPU) are done on the host side.

I know it is big amount of work to be done. I will decide to start this project only if a subject of my Thesis will be strongly correlated to it. Now I’m on stage of choosing subject and estimating time for Universal GA for CUDA. Im also aware that main issue is to find best compromise between flexibility and performance. I thing the best way to do that is by experimenting :). Unfortunately in case of heavy evaluation process (e.g. NN) probably better idea is to parallelize execution of fitness function only and perform evaluation of population sequentially. This leaves no place for universality of created solution :( Only that case give conclusion that creation of truly universal GA framework for parallel environment is not possible :( In spite of that Im still enthusiastic about this subject because that framework still can find many applications and improve work of many researchers :))

To Kex:
It is very valuable remark that dealing with random numbers in parallel system can be time consuming especially when working with algorithm that hardly depends on RNG. I will also consider this issue to estimate entire effort. I already found some topics on forum where some solutions for good RNG are presented so I think this is barrier to overcome.
I`m very interested in use of GA for neural networks evolution. Can you write something more about purpose of your work? (i hope it wont be considered as off topic ;] )

I am currently looking into making evolutionary strategies work, especially for multiobjective optimization, but i am quite unexperienced with CUDA.

The problem that i am facing is: with the limitations in register and shared memory size, is it possible to calculate a fitness functions based on many decision variables with a significant speed-up using CUDA? Let’s say a fitness function is using 30-50 variables that have to be read, in this case access to global memory can probably not be avoided and could lead so a significant performance drop-down?

It shouldn’t be a big deal unless you are repeatedly reading from and writing to global memory. You could store the ‘genes’/decision variables in global memory, reading them into the kernel, performing the fitness function, then writing the fitness value out to another array in global memory. With just the read in the beginning and the write at the end, you should still get a pretty good speedup if you’ve got a good number of ‘sets’ of decision variables to evaluate (or a particularly long fitness function).

I wrote one to evolve a wavelet for video compression (actually 3 1-dimensional wavelets, which is far less superior, but I couldn’t find accessible theory for non-tensor-product constructions). For me, the fitness function was much more expensive, so I didn’t optimize much of the GA code. I also generalized a linear crossover scheme so I could evolve good solutions in 15 iterations, which is pretty atypical for genetic algorithms. I pretty much lifted nVidia’s mersenne random kernel, though for most cases you’ll probably be able to copy random data from the host.

I wrote it in Octave first, which took a while, then reimplemented it on the GPU in about 10 days. Having the Octave source made development a lot faster, because I was pretty much just retiling and parallelizing it.


What about the consideration of data transfer between the host and device? Wouldn’t it be preferable then to do a complete implementation on the GPU to eliminate data movement? I am also trying to implement something similar and right now, I’m getting stuck with how to best use device memory for this problem. Any suggestions?

Hi all-

Just stumbled across this post. Guess what? I coded a generic CUDA genetic algorithm as a testcase for a more sophisticated genetic algorithm which I’m now using for high-level data analysis. Each thread block is dedicated to calculating the fitness of a single individual, while each thread within a block computes a single contribution to the fitness ( my fitness function is a metric describing differences between absorption spectral lines over many wavelength bins)

This testcase CUDA GA is basically a simple analytical function optimizer, in which you the user can specify the dimension and functional form of the fitness function. It evaluates the fitness of the entire population in parallel.

I’m not sure, but what do you guys mean by a “universal” GA?

If anyone is interested, I’d be glad to share the code. Cheers!

Hello Sundog,

I’m also considering using a CUDA genetic algorithm to perform global optimization, but I’m quite a newbie in CUDA programming. So I would be highly interested in your code, it would be perfect for me to see how this can be done :)

Would it be possible to share your code ?

Cheers !

I am also very interested in this code, if you can share it. The obvious way to make a “generic” kernel would involve passing a function pointer to some fitness function defined in host code. From a programming stand point, it doesn’t seem so hard to require one to define this on the device instead.