I implemented an dynamic (time based) algorithm from the world of material sciences and running out of steam when it comes to simulation due to floating point precision.
System: SUSE 11.1 180.11, Quadro FX 4600
If I compare (diff) results after 1 time loop, I find already differences between Gold (CPU implementatio) and CUDA (GPU implementation) like this:
Now, this was for a small system size and doesn’t affect much, however depending on the system size, I also get loads of NaN’s and run into problems with the precision being not correct.
This difference is somehow fluctuating. I rely on my C code which works rock-solid and further on the correct numerical/algorithmic implementation of the GPU code, which was double checked several times now.
I checked every mathematical operation and made sure that X.Xf is for floats or X.YYYYZf if precision is needed like 1/6 = 0.166667f.
The algorithm itself actually doesn’t need double precision, that is why I implemented it on CUDA. For the C version, everything works perfectly using float! Again, the Gold version is working perfectly on C code for any time steps and any system sizes (all 2D) with similar implementation. The Kernel is executed by a for loop for REPEAT times.
Is there anything I can do here beside buying a sm_13 hardware?
Any compiler settings I should consider?
Do the kernel execution settings dim3(x,x),dim3(y,y) influence this?
Within the algorithm, only +, -, * and / are used, no special functions like sin, etc. (because I am aware that I should use sinf in such cases.)
Any hints or help might be really appreciated because massive speedup can already be seen for large system sizes, however I am not using shared memory yet and further the output is wrong on GPU. Thx!