GA & NN & CUDA

L.Allen · April 9, 2010, 3:18pm

Hi, everyone

I am wondering if GA or neural network would be fit for CUDA programming

My idea about neural network is that I can imagine a thread to be a neuron. then

I can simulate the neurons by GPU threads. At present, I am trying to make

a project that combine GPU with GA or maybe NN. But, I am not sure whether

they are compatible with each other. If they can’t and what’s the barrier ?

I would appreciate if someone could give me some advice.

BearishSun · April 9, 2010, 5:26pm

Sounds like it could work.

However, having a neuron per thread might not be a good idea, as they’re quite simple, it will probably just be faster to iterate through them. Maybe multiple neurons per thread?

Sarnath · April 12, 2010, 6:11am

We had implemented GA on CUDA. Many others have also done before.

Essentially a GA has a “loop carried dependence” between generations. So, you need to look for parallelism within a “Generation” - mainly in fitness function evaluation for the entire population.

If that one has sufficient parallelism, one can work it out on CUDA.

timon · April 12, 2010, 3:29pm

Hi I’m doing the same thing. It’s a free software project you can download by typing (you need subversion):

svn co [url=“Parallel Reinforcement Evolutionary ANN download | SourceForge.net”]Parallel Reinforcement Evolutionary ANN download | SourceForge.net preann

It’s under development so I’m changing a lot of things right now. You may find it hard to read it since It’s not documented yet.
I have parallelized the calculation of the state of a layer. It’s going to be called many times during the fitness evaluation, but the fitness evaluation is going to be called for each individual in a serial way.
Now I have different versions, but I’m not happy with the performance yet because I get better times using an assembly SSE2 (XMM co-processor) function. I think I’m doing something wrong.

In one of them, each thread is an output neuron and the inputs are shared within the same block. I’m getting shared memory conflicts with this version but I don’t know why. I asked for help in this post but no one answered my questions.

In the other one (which right now is faster, but I think it shouldn’t) each block is an output neuron. Each thread within a block calculates a part and then are all accumulated in a similar way than in the reduction example.

The activation is calculated in another kernel.

My networks are allowed to have recurrences and a Layer can take any number of other Layers as inputs.

One thing you can think is strange is that I’ve implemented the Layer so they can be bits {0, 1} or signs {-1, 1} (also represented by bits in memory) instead of floats. This saves memory and can lead to better times. In those cases I use byte (instead of floats) weighs in order to reduce the query space for the GA.

I hope some of this is helpful for you.