float accuracy different chips

Hi All

I am porting a scientific code on GPU, developing mainly on a laptop (8600M GT) and running production
simulations on a GTX 460.
So far this has not been an issue, as results for the same executable were the same on both machines.

However, after using shared memory to optimize, the same binary, compiled on laptop for 1.1 capabilities and only using
float to represent real numbers, gives very different results on the two architectures as shown in the attached png.

The plot shows the sum of a scalar field on a lattice at each timestep of the algorithm for the same binary run
on the two chips. You can see that the difference is enormous.
The results for the program running on GTX 460 (red line) is nearly indistinguishable from the results given by
the original CPU-only program, as it was the case for results using the different machines with old GPU version of
the program that did not use shared memory.

I cannot understand this.
Please help

This sounds suspiciously like a race condition when threads access the shared memory. I would check to make sure you have __syncthreads() in the appropriate places.

Thanks for the suggestion.

AFAIK, __syncthreads is where it should be. Adding extra __syncthreads does not help either.

Actually, what happens is that if I remove __syncthreads(), the 8600M result does not change, like

__syncthreads is not doing anything, while of course code run on 460 is also incorrect.

How can this be?

That’s exactly the sort of behaviour one might see with a race condition - they’re tricky like that. Furthermore, 8600M to GTX 460 is a huge jump in hardware, leaving plenty of places for race conditions to become exposed. However, without seeing actual code, all anyone on this board can do is speculate wildly.

It seemed I nailed it.

I use dynamically allocated shared memory. The sh_mem size parameter used at kernel launch was too short,

for example 1928 instead of 2048 bytes.

Now program works fine, but on a normal CPU that would have SIGSEGV and still I cannot understand why a superior

hardware let it gently slip and laptop instead was acting as no synchronization was happening.

Would cude-gdb detect such error, using set cuda memcheck on?

Are you checking the return codes of the CUDA calls? A GTX 460, and any other Fermi-class GPU, should generate an error when an out-of-bounds shared memory access occurs. (Older GPUs did not detect shared memory violations.)

Some random speculation: this race condition may well have popped up because the Fermi card was able to run more than one block per MP (I don’t know your register usage or block size, so I’m just guessing). This meant that blocks could ‘step on each others toes’ in shared memory. On the older card, the kernel was restricted to one block per MP, so the extra smem usage didn’t affect anything.