Genetic Algorithm need a strategie


I need advise from people with more experience than me with CUDA(it won’t be difficult :whistling:). I’m trying to port a genetic algorithm on cuda who’s doing molecular docking. It can be resume in 5 steps :

For the moment, the code is pretty heavy and it’s coded in C++. I’ve tried different approach with smaller et much simpler genetic algorithm to see what I can do. I’m thinking that the best way to begin the porting on CUDA is to recode the evolution. Then, for example, each member of the population with evolve in parallel and I won’t need a for loop to pass the whole population.

An other solution is to port the whole code on CUDA to run multiple molecular docking. This solution seem stupid for me because the cores on GPU are slower than a CPU but I want to hear you’re opinion.

Thanks for the advise

Check the gpuAutoDock project. But they only GPUized “eintcal” and “trilinearInterp” functions - I guess.

They also have results with more population size. As far as I remember, they did not get impressive speedups.

Moving entire code into GPU should be agood option because u have lots and lots of cores… if u can get things into shared mem, you are a winner.

We are sailing on same boats.


Wow, Your project excite me a lot as my background was in Bioengineering. I feel porting the whole code will be better, but starting with small steps. Let me know if I can be of help

First, thanks for you’re suggestion.

I don’t know. I have a bad feeling about porting the entire code on the GPU. By doing this, I’ll be only able to run one instance of the same program per GPU thread. The GPU run slower than a CPU and my algorithm take about 5 minute to run on CPU. By porting the whole code, I don’t want to have an execution time of one hour (for example). Furthermore, I don’t know if everything can be contain in shared memory. As I said, the algorithm is heavy. On the other way, if I do, I’ll be able to resolve a LOT more molecular docking at same time.

I think that I will code only the critical part like the evolution of the population first. If it’s not enough, I’ll port the entire code by calling a kernel for each step. And then, if it’s not enough again, I’ll do the whole porting.

Is it a good strategie or I won’t be able to gain speed by calling a kernel for each critical step?

It’s completely up to you how much or how little you wish to port to the GPU and naturally if porting a piece of code makes things go slower than on CPU, there’s no point in doing that. It’s a widely accepted programming model to launch CUDA kernels to offload hot spots in compute intensive code.

Kernel calls have an overhead of around 10 microseconds so it should be fine to call them only where needed.

Hi helmoz-

I have been working witha CUDA-ized genetic algorithm for some time now, and I can say the best way to start off is to just offload the evaluation of the fitness function for the entire population to the GPU. I’ve done some profiling of many of my own genetic algorithms, and an overarching commonality is that almost 90% of the actual wallclock runtime is spent just evaluating the fitness function for the population. Pull this out of a serial loop, and kernelize it, and you’ll see some nice speedups. Then you can worry about kernelizing the genetic operators later. On my most heavily-used genetic algorithm, I see regular speedups of 18x to 20x just from offloading the fitness function evaluation. Hope this helps.

That really depends on how much parallelism can you get from parallel population evaluation. If the population size is 50 and specimen evaluation must be sequential, then it’s not enough for CUDA. 50 threads is nothing, think 50 000. You have to either have huge populations or be able to break down specimen evaluation into many threads.

provided the Algo is parallel, GPU will run faster, If properly programmed. Ok, As u rightly said, U can start with “Docking” and can think of remaining parts.

If Ur CPU takes around 5 minutes, GPU must take less than 5 minutes ( Above conditions applied).


Since my last post, I have done some works! I made two version of a dummy GA who work on cuda. The first do an evolution of the population and calculate the fitness of each individual. The second work entirely on GPU and I’m able to run multiple instance of the same GA at the same time. It is pretty cool because for a big density of data, I have a good speedup.

Now I have to port an existing GA wrote in C++. I’m currently experimenting some problem because the code seems too complex (logic operator and loops) and I use a lot C struct. The result is the compiler crash with a Memory allocation failed message. I know that’s cuda is made for more simple code than mine. So what I’m trying will be to convert the current state of all the objects in an big array and load it on the GPU. By doing that, everything is store in the global memory. Honnestly, I feel this to be a despair attempt. I’m sure to constantly read and copy between the host and the device will just made the program run slower.

Any advise?