Genetic / Neural Implementation I'm new to CUDA and I'm totally stuck

I’ve been developing code for a neural network in C++ for the past while. Most of my problems are dealing with 750 - 1000 variables, and I’ve had a tough time getting local optimizers to find a global minimum.

I’d love to rewrite the code to use a genetic algorithm in CUDA to speed up the convergence. I have not yet implemented the code in C++ yet, as the idea to go GA and CUDA came at (almost) the same time.

Problem is, I’m totally stuck. I’ve been reading all sorts of suggestions about memory usage - keeping as much as possible within the shared memory, etc. That’s only 512 elements, and with a problem that may need a population matrix of approx [100, 25000], this clearly will not fit in shared memory. I’ve become totally confused now on how to approach the problem. Initially, I wanted to move EVERYTHING to the GPU. I’d get random numbers for the initial guess for the population matrix, and from that time on, keep it there.

Can someone please give me some direction on how to go about this code? I can explain the math, if necessary. I am just totally stuck now.


Pushing all onto GPU is a good idea. With 25000 variables you shouldn’t have problems as you have something like 256MB-1GB of global memory (depending on your card model).

I played with neural networks long long time ago and I don’t remember everything, so I am not sure what are exact requirements of your problem, but in general, if every computation may depend on any variable, pushing them to shared memory won’t help much. However if you access global memory in a coherent way and you have enough threads runnning in parallel you should be able to hide global memory access latencies well.

One could use texture if shared mem is not the right thing to do.

hi, i’m developing a project of the same kind: neural networks + genetic algorithm
until now i was using XMM instructions to optimize the function that calculates the state of a layer. Now, i’m going to use cuda. The only thing i want to run in the GPU is the neural network thing. The evaluation function of the GA calls the functions for getting the output of the neural network to a given input A LOT OF TIMES. that’s why i thought that wasn’t neccesary for me to implement anything else on the GPU. I got a post here. For the entire network and the GA i’m using C++.

As I’ve started developing this further, I’ve also came to approach it in a similar way. I was afraid of host/device transfer bottleneck. Frankly, I was having trouble conceptualizing the genetic algorithm in parallel, and rather than waste time with that, figure out permutations, etc on GPU, I’m planning to do all that work on the CPU while the GPU solves the output for the network. The side benefit of this is that I can have a much larger population without worrying about running out of global memory on the video card. I haven’t yet written the code, but the idea is to now work with a population on the order of 50,000 vectors. Size of the training set is undetermined as of yet, but probably somewhere between 5,000 - 15,000. I’ve got no idea how fast GTX 260 will crunch through this. Depending on that, I will determine whether to transfer the trial population synchronously or asynchronously. I have not looked into any implications of doing it either way.


maybe this project on neural networks on CUDA could be of some help…
i found it pretty helpful -


What I do is basically:

  1. Generate population host side

  2. Push a generation device side

  3. Evaluate fitness (Major speed increase here thanks to cuda)

  4. Push fitness values back

  5. Mutate and such

  6. Repeat 2-5 until satisfied