I have a small app which solves a set of coupled differental equations in 1D space + time. Each iteration is quite small, contains something like 100 points to move forward in time, but each point requires a bit of computation (including image-lookups). I have tried to implement in different ways. One just using global memory at all and one using local memory + halo (5 point stencil, so 2 halo points at each side) for the finite differences. Here are the execution times (of the kernel):
8600GT global memory version:
8600GT local memory version (local group size 32 is optimal):
GTX 285 global memory version:
GTX 285 local memory version (local group size 32):
GTX 285 local memory version (local group size 2 is optimal!):
For reference, a C++ version on a 2GHz Xeon takes 18.1s.
Both GPU versions run single precision, while the CPU uses double precision.
Any ideas to why the GTX 285 is SLOWER than the 8600GT when using local memory and ONLY just as fast when using global memory?
Everything is under Linux with the latest Nvidia beta drivers.