I am porting a scientific code on GPU, developing mainly on a laptop (8600M GT) and running production
simulations on a GTX 460.
So far this has not been an issue, as results for the same executable were the same on both machines.
However, after using shared memory to optimize, the same binary, compiled on laptop for 1.1 capabilities and only using
float to represent real numbers, gives very different results on the two architectures as shown in the attached png.
The plot shows the sum of a scalar field on a lattice at each timestep of the algorithm for the same binary run
on the two chips. You can see that the difference is enormous.
The results for the program running on GTX 460 (red line) is nearly indistinguishable from the results given by
the original CPU-only program, as it was the case for results using the different machines with old GPU version of
the program that did not use shared memory.
This sounds suspiciously like a race condition when threads access the shared memory. I would check to make sure you have __syncthreads() in the appropriate places.
That’s exactly the sort of behaviour one might see with a race condition - they’re tricky like that. Furthermore, 8600M to GTX 460 is a huge jump in hardware, leaving plenty of places for race conditions to become exposed. However, without seeing actual code, all anyone on this board can do is speculate wildly.
Are you checking the return codes of the CUDA calls? A GTX 460, and any other Fermi-class GPU, should generate an error when an out-of-bounds shared memory access occurs. (Older GPUs did not detect shared memory violations.)
Some random speculation: this race condition may well have popped up because the Fermi card was able to run more than one block per MP (I don’t know your register usage or block size, so I’m just guessing). This meant that blocks could ‘step on each others toes’ in shared memory. On the older card, the kernel was restricted to one block per MP, so the extra smem usage didn’t affect anything.