Help optimizing simple GA

Hello, I’m strugling with a very simple GA, the ideia is to make a performance test with some of my previous work.
What I’m trying to find is a 3 Radius Cellular Automaton “capable of solving” the majority problem.
Since it’s a very simple test, I’m actually struggling with the performance, I can make the hard part of the evolution (the fitness calculation for all the population) in around 680ms in a GeForce GTX580, however, I think this is way too long for such a simple task (all and all, there are only 100CAs, trying to solve 100ICs with 149 as the Lattice Length).
What I would like to know is if anyone can help me out to increase the performance, even some pointer on what I can do, or if I’m doing something wrong. And yes, I’ve been using the Visual Profiller for some time now, however, honestly, I don’t know what to do anymore.
The full source code can be found at http://trac.geekvault.org/cuda/browser/trunk/cuSimpleDCT/src.
Any help will be greatly appreciated.

Thank you

hmm… I didn’t get into your code thoroughly, but what I can say…

  • i’m just wondering, 10 cuda streams in a single device ?? is it good ??.. because more stream doesn’t always mean better… so, you should check the optimal number of stream… in my experiences so far, my code performance became worst if I set more that 2 or 3 streams…

  • in Fermi, atomic operation is really time consuming, if it’s possible, better to find another strategy… :D or, you’re able to run it on Kepler, I heard NVIDIA improves atomic operation on Kepler… :D

  • what about if you use fast-math option ? it tends to make faster but less accurate…

  • have you tried to set global memory to be cached in L2 only ? it means giving more space for local memory cache in L1…

First of all, thanks for the help ;)

Now, by using 10 streams I got a speed up of ~100ms, I know maybe 10 streams is waaaaaaaaay too much, so I’m still fidling around with the perfect amount of streams.

About the atomic operations, I’m not running this on a Kepler, it’s Fermi only really, and, withouth creating semafores and related stuff to avoid race conditions, I really don’t know how to make the density and fitness evaluation withouth not being sure if it’s correct.

Were would you use the fast-math options? Could you clarify that a little more?

How can I make the global memory to be cached on L2?

Once again, thanks!

Well, I REALLY need to thank you!
Changing powf() to __powf() DID GAVE ME AN ENORMOUS INCREASE! Now it’s running under ~200ms (~160), and if I use exp2f() instead, it runs at ~150ms, withouth loosing the precision, this means that now, the whole program runs under 17secs, and a full test (100 executions) will run under 30mins! Compared to the 1 hour and 30 minutes I had before…just…WOW!

Thank you soo much!

Thanks mteguhsat, however enabling the compiler flags made the performance worse :(. Both for the fast_math and l2_cache only.
What really kikked was the change from powf to exp2f…