2600x speedup for GA? Is this fake?

In a paper published by two researchers from Brno University of Technology, the researchers claim that they’ve achieved speedup of a Genetic Algorithm on GeForce8800GTX against Core i7 920 to be up to 2600x. At first sight, I was like: WOW!!! But then come to think about it, is this thing really possible? Could this kind of 1000+ speedup really be achieved? So I’m in doubt and am looking for some clues from people who have experience with GA on CUDA.
the paper can be found here: http://www.gpgpgpu.com/gecco2009/7.pdf
Actually I’ve googled the names of the researchers and didn’t find anything meaningful. The university is not a small one, though, with 22000+ students.

If this thing is real I’m gonna be so excited :w00t: !!! I’m only starting to play around with CUDA and my card is only GeForce 103M so cannot really confirm this myself.

Hi

one can achieve whatever factor you whish - just depends on how bad your cpu code is. Id say typical speed ups for reasonably well written codes are more in the region of 10-50 or so for comparing single gtx280 vs one core of a modern cpu. Much is depending her on the problem. If one would use a quad core nehalem (escpecially the better ones like Xeon 5550) and utilize all 4 cores (obviously the problem can be solved in parallel otherwise you would not talk about CUDA) the speedup vs a single 280GTX is probably more in line with 2-10. For comparing a single 8800gtx vs an i7 id exspect defintely speed ups of <10 if one compares against 4 cores of an i7. On the other hand I find it more easy to write good code with CUDA when say using MPI or so for parallel code on a cpu. Also you can have two GPUs in one chassis with realtively low extra costs, and two GPU programs will hardly slow each other down.

Here is a thread which discusses more of that topic:

http://forums.nvidia.com/index.php?showtop…t=0#entry997641

(its worth reading especially the fact that people could speed up the cpu code of the original poster by a factor of 1000 after optimizin the cpu code).

Best regards
Ceearem

A x2600 sounds indeed too high :) as ceearem said - it is probably because of a bad CPU code.

As for the other remark, of comparing one GPU vs the 4 cores of a quad - this is also misleading.
I think the compare should be a single GPU vs a single CPU core. You can easily put 4 or 8 GPUs in a system
making it equal to the numbers of CPU cores. so the speedup will remain the same.

eyal

It probably means they had a very crappy CPU reference implementation (which is not at all uncommon).

Depending on the nature of the compute task, the natural upper bound on speed-up should be the ratios of peak memory bandwidth and peak flops between the cpu and gpu. That should be about an order of magnitude for a current quad core CPU and a GT200 based GPU. Applications which perform a lot of interpolation or filtering can achieve higher speed-up if the code can exploit the GPU texture hardware well - that collapses (memory fetch + flops) on the CPU to just (memory fetch) on the GPU.

I agree that this is possible, on the other hand code often does not scale as well over multiple GPUs as over multiple cores of one cpu, since you can use shared memory in the latter case, while you have to do slow synchronization over PCIe in the former case. Thats why I added that one reason why a factor of 2-10 vs a Quad core CPU is not bad at all when you consider that you could run two or more instances of the programm when adding additional GPUs to the system which is not as costly as adding additional CPUs.

Best regards

Ceearem

I did a GA on CUDA and got 10 to 30x speedup. The core fitness was 55x though… This one is a TESLA against an AMD Athlon 2.41GHz

However, the CPU implementation can be improved a lot… In my case, it is a standard software available as free source on the net…

I see that the paper runs multiple GAs in parallel. Each GA happens within a block (island) – which is an interesting proposition - It may suit some kind of problems that require multiple populations to be grown and simulated.
Anyway, It does not explain 2600x… but I think it is a great idea to do that - providing the problem requires it.

Hello everyone!

I’m the autor of “paper” above. I’m working on department together with Lukas Sekanina and Jiri Jaros.

I’d like to point out several things first:

    []Work mentioned above is just a very brief summary and doesn’t contain several important details. Full paper about this topic will be released at EVO conference this April.

    Speedup is computed against single-threaded CPU application. As reference, we used GaLib library (originated from MIT) because we wanted to compare results quality with very well debugged code. This library isn’t very well optimised (our code was about 30% faster), but its often used in practical benchmarks.

    We managed to fit whole working set to shared memory within multiprocessors and there is very little data dependency as well as massive paralelism (thousands of independent threads running in parallel in addition running same code). Because of this, GPU is ideal for this task.

    Simple artificial benchmark functions were tested.

I agree this number seems to be very high, but as I said, in this particular case, GPU is ideal for the job and number is not fake. If you have any questions, Im here to answer them.

have a nice day,

Peter

Hi

ok I agree that it is at least a good approach to compare against established code. Reading your paper there are several points which come to my mind why this high factor can be achieved:

first: it looks as if you are totally compute bound so that should give already a factor of maybe 25 or so compared to a single threaded cpu library

second: you wrote that parameters are implemented inside of the code as constants instead of real parameters, this alone already should give a significant boost (especially if the code is relatively simple) but it would also suggest that the functionality of your code is significantly lower than the reference library. This is one point one should generally keep in mind: one can safe a lot of time if you reduce the scope of what is possible with your code.

third: how much time does the random number generator cost? In some simple codes (i.e. ising system) compute time is dominated by the random number generator. Depending on the used algorithm the time for the random number generator can vary easily by an order of magnitude.

best regards
Ceearem

Well, while it’s not a fake measurement, in my opinion it’s not fair to compare a general purpose, non-optimized, single-threaded library to narrow-scoped, highly tuned code. You know, you could’ve just taken an implementation written in bash script and it wouldn’t be “fake” either…

Now if you had compared general-purpose, extensible GPU GA code (good luck with that ;) ) with GaLib or alternatively your code with hand-tuned, multithreaded and vectorized CPU code, then that would be good science. Right now, IMHO, that’s just trying to prove how awesome GPUs are.

Hey Peca, your CUDA implementation seems like a good one. Maybe you could write a CPU version for that implementation and tell us the speedup against the GaLib library? You said your code is 30% faster than the GaLib library. I suppose the code you are referring to is not the CPU version of your CUDA implementation. Because if it were, the actual speedup would still be 2000x, which doesn’t seem to be theoretically possible.

I was discussing some time ago similar paper (maybe this one) with colleague at work and it is GaLib library, it is sluggish = slow CPU code. Just profile it and you will see.

You must use good CPU code to compare it with GPU.

Is this (CUDA) code published anywhere? I’d be interested in benchmarking it.

Well I’m not sure about 2600x speedup, and I agree with all the above, however here is simple test program which shows about 1000x speedup on gpu vs single core cpu (tesla vs Zeon 2.66).

The code executes a sigmoid function on every element of an array. I’ve not used shared memory so it’s not really optimised on the gpu, and the cpu code is good, so this is a very fair comparison. It is however a problem ideal for parallelisation so most code would not show such a huge speedup, but just to show whats possible…

Of course you could multithread the cpu code, but you could also multi-card the gpu…
sigArray.cu (2.02 KB)

I don’t want to hijack this thread any further, but if you think that CPU code is “good” (by which I presume you mean optimal), you have got a lot to learn. I don’t believe you could possibly call comparing that device and host code “fair”.

What I mean is that niether code is optimised, in fact as all it does is 1/1+e(-a[i]) in both cases, per thread in the GPU and per iteration in the CPU I think it is a fair comparison. There is virtually no difference between the cpu and gpu implementation. Perhaps you disagree?

Yeah, the CPU code is not even parallelized.

Also, you wouldn’t benefit from using shared memory here at all.

More to the point, is the CPU code using SSE? If not, I believe there’s a factor of four immediately. This would bring the ratio of single core CPU to GPU to about 250x, which is believeable for something totally compute-bound.

Exactly. The original code, compiled with gcc 4.3, takes 45ms to run on one core of my 3GHz Phenom II. This:

#include <xmmintrin.h>

#include <acml_mv_m128.h>

void incrementArrayOnHost2(float *a, int N)

{

  int Nser = N%4;

const __m128 ones = _mm_set1_ps( 1.f );

  const __m128 minusones = _mm_set1_ps( -1.f );

for (int i=0; i < N; i+=4) {

	__m128 expavals = __vrs4_expf( _mm_mul_ps( minusones, _mm_load_ps( &a[i] ) ) );

	__m128 result   = _mm_rcp_ps( _mm_add_ps( ones, expavals) );

	_mm_store_ps(&a[i], result);

  }

for (int i=Nser; i>0; i--) a[N-i] = 1.f /(1.f+fastexpf(-a[N-i]));

}

takes 10 millseconds for the same. My GTX275 runs the device code in 0.11 milliseconds, so the 1000x times figure just became 100x.

Using the AMD core math library here? When I google __vrs4_expf it’s all AMD ;)

So now when this code runs in 4 cores, the factor becomes 25. Still impressive, but much less impressive than 1000.

Yeah ACML has pretty good sse and sse2 vector implementations of trancendentals (plus it is free as in beer, unlike Intel’s MKL). Previous experience suggests that there is anything up to another order of magnitude speed up available if the transcendental functions can be inlined (which ACML can’t), but that was as good as I good do in 10 minutes…