I am porting a scientific code on GPU, developing mainly on a laptop (8600M GT) and running production
simulations on a GTX 460.
So far this has not been an issue, as results for the same executable were the same on both machines.
However, after using shared memory to optimize, the same binary, compiled on laptop for 1.1 capabilities and only using
float to represent real numbers, gives very different results on the two architectures as shown in the attached png.
The plot shows the sum of a scalar field on a lattice at each timestep of the algorithm for the same binary run
on the two chips. You can see that the difference is enormous.
The results for the program running on GTX 460 (red line) is nearly indistinguishable from the results given by
the original CPU-only program, as it was the case for results using the different machines with old GPU version of
the program that did not use shared memory.
I cannot understand this.