time dependant simulation problems with floating point precision

ludx · April 21, 2009, 11:22am

Hi all,

I implemented an dynamic (time based) algorithm from the world of material sciences and running out of steam when it comes to simulation due to floating point precision.

System: SUSE 11.1 180.11, Quadro FX 4600

If I compare (diff) results after 1 time loop, I find already differences between Gold (CPU implementatio) and CUDA (GPU implementation) like this:

Now, this was for a small system size and doesn’t affect much, however depending on the system size, I also get loads of NaN’s and run into problems with the precision being not correct.

This difference is somehow fluctuating. I rely on my C code which works rock-solid and further on the correct numerical/algorithmic implementation of the GPU code, which was double checked several times now.

I checked every mathematical operation and made sure that X.Xf is for floats or X.YYYYZf if precision is needed like 1/6 = 0.166667f.

The algorithm itself actually doesn’t need double precision, that is why I implemented it on CUDA. For the C version, everything works perfectly using float! Again, the Gold version is working perfectly on C code for any time steps and any system sizes (all 2D) with similar implementation. The Kernel is executed by a for loop for REPEAT times.

Is there anything I can do here beside buying a sm_13 hardware?

Any compiler settings I should consider?

Do the kernel execution settings dim3(x,x),dim3(y,y) influence this?

Within the algorithm, only +, -, * and / are used, no special functions like sin, etc. (because I am aware that I should use sinf in such cases.)

Any hints or help might be really appreciated because massive speedup can already be seen for large system sizes, however I am not using shared memory yet and further the output is wrong on GPU. Thx!

:-)

Cheers,

Lx

navier-stokes · April 21, 2009, 11:53am

Hi!

There are always differences between CPU and GPU codes. And there are two main reasons for those differences:

The floating point unit of the x86 CPU uses always 80 bit precision for any calculation. Only storage is done unsing 32 or 64 bits. Your single precision GPU uses 32 bits for storage AND arithmetic.
Floating point operations aren’t commutative – even if analytic arithmetic of real number is. This is due to granularity and truncation error. E.g:

float sum = 0;

for( int i = 0; i< 10eN; i++ )

  sum += 10e-N f;

sum += 1;

Now we expect sum to equal 2.

This is not necessarily the same as

float sum = 1;

for( int i = 0; i< 10eN; i++ )

  sum += 10e-N f;

If N is great enough, sum will be 1. 'Cause 1 + 10e-N will be truncated to 1.

Since the order of execution of all threads is indeterministic one can’t predict the results.

avidday · April 21, 2009, 12:01pm

It is pretty hard to provide specifics given the complete lack of detail, but I will make a couple of observations

[list=1]

[*] Your host CPU does floating point arithmetic using 80 bits internally, even on single precision types. ULP differences between the same calculations done on host and GPU are to be expected.

[*] The CUDA programming guide describes the error bounds and rounding behaviour of every intrinsic function, so the expected behaviour of a given caculation can be predicted a priori (and compensated for, if necessary when it will deviate from IEEE 754 in a way which will upset your algorithm).

[*] I would never expect a well implemented floating point algorithm to return NaN values, even at the onset of overflow or underflow. That alone suggests you are doing something wrong whether you recognize it or not.

Sarnath · April 21, 2009, 12:08pm

Absolutely. I second this statement.

MisterAnderson42 · April 21, 2009, 12:56pm

It can happen more easily than you think with algorithms that solve time-dependant differential equations via small time steps. When the step size is too large, the solution becomes unstable and errors accumulate exponentially in time. After just a few steps, NaNs can easily result as solutions blow up to infinity.

However, the onset of instability comes from a truncation error which is much larger than round-off error differences between single and double precision. In my work, I’ve found that the limit of stability is the same for both single and double precision. That doesn’t necessarily mean that ludx’s system is the same… Still, a bug (i.e. kernel 5s timeout, memory read past the end of an array…) may be a more likely source for NaN’s here. Is your system based on a delta-t timestep? What happens if you lower delta-t?

avidday · April 21, 2009, 1:22pm

Of course CFL limits can come into play in time marching schemes, as can shocks in strongly hyperbolic problems and non-physical frequency components in elliptical problems. But that has nothing to do with the relative floating point accuracy of two processors, and everything to do with the properties of the PDEs being solved and the numerical methods used to solve them.

I stand by what I wrote. You aren’t going to get numerical blow up unless your code implements something which is mathematically inappropriate for the problem at hand, or the code itself contains algorithmic errors. And for that to occur on one platform and not on another either means your algorithm is relying on subtleties of IEEE 754 that might vary from platform to platform (floating point predicates for computational geometry are a great example of this), or there is something seriously wrong, and in my experience, 9999 out of 10,000 times it is the code and not the compiler or the hardware.

MisterAnderson42 · April 21, 2009, 2:52pm

I just take this statement personally. As you have worded it, my code (along with every other MD package out there) is thus deemed “mathematically inappropriate”. The user chooses the step size and too large a step size can lead to NaNs. Moreover, it is not possible to know a priori where the stability threshold lies without first doing some test simulations to find out!

Did you miss the part where I agreed with you?

This is all getting off-topic, though. I was merely posing an alternative possibility to the OP for the source of the error. If the code is based on delta-t steps AND lowering it makes the problem go away, then the likely culprit is the region of instability. It would be a simple check that would take less time than debating over and over what is mathematically appropriate or not which will not solve the OP’s problem.

To get back on topic, I’ll give some more detailed answers:

Quadruple check for bugs first. If you gained anything from the bickering in the previous posts it should be that single vs double precision is probably not the cause of your problems.

Not really. nvcc performs only non-dangerous optimizations by default. Don’t use --fast-math as it can change sin() to __sinf() and the like without you knowing about it (__sinf() and similar intrinsics are only valid for a narrow input range).

Only in the order of operations. As you should know, floating point arithmetic is not transitive and the order in which operations are performed can change the results. The differences will only be in the 6th significant digit or smaller from one timestep to the next. If your system is chaotic, then the precise behavior over longer trajectories can diverge significantly as you are following different bifurcations in the chaotic system.

Edit: Oh, I should add that too large a setting for the block size might cause your kernel to fail to launch. This is one of the many importnat reasons to check for errors returned by CUDA…

One big thing I will add is that you should check for error return values from every CUDA call and add in a cudaThreadSynchronize(); cudaGetLastError() after every kernel to see if any fail. If a call failed, your NaN’s could be just garbage data. Of course, you’ll want to be able to disable the error checks after kernel calls with a flag for performance reasons once you get everything debugged.

Other likely culprit bugs are reading/writing past the end of arrays as I mentioned. Compiling in emulation mode and running in valgrind can be immensely helpful here, as can careful hand checks for problems.

Here are a few things to think about:

Maybe a few more details about your algorithm could help us guess? Are you trying to do any inter-block communication that would introduce race conditions? Exactly how many time steps does the solution take to start diverging to inf / NaN? Is it the same number every time? Are the values before you get to inf / NaN repeatable bit for bit every time you run your code on the GPU (they should be)?

This is getting back a little to the float vs double argument, but:

Is there a situation in the code where you are subtracting two numbers that are very near each other? That can cause serious problems in both single and double precision calculations.

e.ping · April 21, 2009, 5:37pm

In my experience, the “almost IEEE-754” singleprec implementation is good enough for the most practical cases, the deviations from the standard can be ignored. Note that I’m not saying that single precision isn’t insufficient in many situations, including the FE stuff I am working on. My point is that for all practical cases, the difference between CPU and GPU single precision are in the noise.
Anyway, to rule out floating point issues, force the compiler to use 32-bit for singleprec on the CPU. This is “easy”, just force it to use the SSE registers. Some compilers even have quite a few flags to force the precision, for Intel and gcc I believe it’s -msse2 or -msse3 or -mfpmath=sse, read the man pages. If the compiler spits out tons of “LOOP WAS VECTORIZED” messages for the relevant bits of code, it succeeded in SSE’ing it (for gcc, you have to increase verbosity of the ftree-vectorize report)

ludx · April 23, 2009, 3:22pm

Hi all!

Sorry for the delay of reply, but it took some more hours of development to sort things out before answering.

Once again, the algorithmic implementation was fully correct. No changes to the code (e.g. BC) were needed.

However, it was found that having to much “operations” within one Kernel messes up the complete simulation.

Splitting the Kernel into several (4) sub-parts did the trick. As mentioned before, sometimes rounding differences like the one mentioned (see below again) still occur but keep “constant” over time, so the simulation after for example 10 time steps is similar correct as the simulation after 100000 time steps and the last significant number we can forget about.

Since the splitting of the Kernel did the trick, can any one of you give an explanation for why that was needed instead of having most calculations done within one Kernel?

Maybe something was wrong with the arrays?

OK, let’s clean the code and put some make up on top of it.

Best regards,

Lx

seibert · April 23, 2009, 7:43pm

Are you checking for error codes in your kernel calls? If you have X active on your FX 4600, then a watchdog timer will terminate kernels which run for more than a few seconds. It is possible that the single kernel runs too long and hits the watchdog timeout, but the separate kernels do not.

ludx · May 12, 2009, 12:48pm

Hi seibert,

sorry for leaving your reply alone for a while.

What do you mean by having X active or better: How do I set or unset it and where?

Cheers,

Lx

avidday · May 12, 2009, 12:55pm

He means whether you are running an X11 server on the card you are using for CUDA simulations. If you are, then the NVIDIA X11 driver has a 5 second watchdog timer which will kill any running kernel on a GPU which is being shared with X11 which runs for more than 5 wallclock seconds. That could be the difference in behaviour between a single big kernel containing the whole algorithm, and a sequence of smaller kernels which implement the algorithm in stages.

ludx · May 12, 2009, 2:38pm

OK, X11.

Well, there is X11 running yes.

4 Kernel calls e.g. split into 1/4 of wallclock time in my case. I am trying to debug why the 1 Kernel still spits NaN’s, even if the algorithm is consistent.

I introduced temporary arrays holding values but somewhere there is “interference”.

Does it make sense to have different indexes within one Kernel?

Since this is a time-based simulation, I still question myself if either a time loop outside the Kernel or within the Kernel slows down simulation drastically since our algorithm itself is very fast with ~100 FLOP (rough guess) only.

Cheers,

Lx

avidday · May 12, 2009, 4:43pm

To paraphrase Don Knuth “Premature optimization is the root of all evil”.

I would be focussing on producing an implementation that worked rather than worrying about niceties like performance. It seems you have fundamental problems if you code is returning Nan values from what is supposedly innocuous and straightforward floating point arithmetic. But I will repeat my earlier complaint. You haven’t as yet provided a scrap of information about what algorithm you are talking about here, or described even a line of pseudo code, let alone anything concrete. It is pretty hard to answer abstract questions about why your code doesn’t work on the basis of such skimpy details.

ludx · May 13, 2009, 9:05am

I know that phrase from Knuth mate.

Sorry for not publishing code here and once more, the code works using 4 Kernels (split).