Bizarre floating point division problem

Hello all. I’ve just spent several days trying to solve a problem in my code. I finally resolved it quite by accident and I don’t understand why, but I suspect it has something to do with the vagaries of floating point division. I’m not an expert at such matters, but perhaps someone here can explain it to me.

My code contains a device function calc_f_tilde_d that is called by a kernel on every timestep. It looks like this:

__device__ void calc_f_tilde_d(

      int dir

    , int thread

    , int block_size

    , double* f_temp

    , double usq)

{

    f_temp[thread + dir*block_size] *= (1. - 1. / tau_c);

double vdotu = vx_c[dir]*f_temp[thread + (numdirs_c+1)*block_size]

      + vy_c[dir]*f_temp[thread + (numdirs_c+2)*block_size];

f_temp[thread + dir*block_size] += wt_c[dir]

      * f_temp[ thread + numdirs_c*block_size]

      * ( 1. + 3.*vdotu

          + 4.5*vdotu*vdotu

          - 1.5*usq

        ) / tau_c;

}

If I run the code for certain values of tau_c (a constant memory object) then the result looks like a numerical instability (the output is a density field, which blows up after a time), except it has unphysical dependence on the domain configuration (for example, a reflection of the domain results in the disappearance of the `instability’). I knew that the problem had to be in this function because it went away if the function was removed.

Then, by accident, I discovered that the following code works perfectly every time:

__device__ void calc_f_tilde_d(

      int dir

    , int thread

    , int block_size

    , double* f_temp

    , double usq)

{

    double invtau = 1. / tau_c;

f_temp[thread + dir*block_size] *= (1. - invtau);

double vdotu = vx_c[dir]*f_temp[thread + (numdirs_c+1)*block_size]

      + vy_c[dir]*f_temp[thread + (numdirs_c+2)*block_size];

f_temp[thread + dir*block_size] += wt_c[dir]

      * f_temp[ thread + numdirs_c*block_size]

      * ( 1. + 3.*vdotu

          + 4.5*vdotu*vdotu

          - 1.5*usq

        ) * invtau;

}

Upon further experimentation, it seems that replacing only the second occurrence of 1. / tau_c is important in determining whether the code behaves as it should.

Does anybody know why this is happening?

My guess is that this code:

(1. - 1. / tau_c);

is being evaluated as

((1. - 1.) / tau_c);

instead of

(1. - (1. / tau_c));

Which is why pulling out the (1. / tau_c) into a separate variable fixes it. I don’t remember the operator precedence rules for C (so I can’t say if your code is at fault or it’s a compiler bug), but it’d be a good idea to use parentheses to force an explicit order of evaluation for peace of mind.

Does this happen with the CUDA 4.1 toolchain? Would you be able to post a minimal, compilable and runnable repro code that demonstrates the result difference due to the code modification? Ideally it would run a single thread block with a single thread only; if there is a code generation issue the problem should reproduce with a simplistic configuration like that.

I’m running CUDA 3.0 stuff - I’m not up to 4 yet. I tried writing a simple code to reproduce the problem but that failed. Actually, I think that the problem is more complicated than the replacement of 1/tau that I discussed. While going from 1/tau to inv_tau fixed the problem in a specific test domain I was using, the problem still exists in other parts of larger domains.

This problem seems to occur for very specific domain configurations. My domain consists of solids and open spaces, and I solve for various quantities in the open spaces, with the presence of adjacent solids affecting the solution. With my old code (with / tau), I noted that while a given domain may result in the instability appearing, reflecting the domain horizontally resulted in normal behavior (This is why I think it is a bug and not a numerical instability).

Unfortunately, this code is fairly thoroughly tested on a number of problems, so if it is a simple ‘wrong-code’ bug, it must be very subtle and hard to find.

OK - I’m now convinced that this is an issue with bad code that leads to floating point error accumulation. Forcing my velocity field to 0 (Which it should be anyway) results in the instability going away.

Njuffa - I think I understand why you asked about version 4.1 of CUDA - it will allow me to prevent the compiler from converting to FMADs. Is this correct? Are there any other ‘optimizations’ I should be preventing the compiler from doing? FYI my cards are compute 1.3.

Compilation for 1.x targets uses the Open64 frontend that does not support the new -fmad={true|false} flag of nvcc (see documentation). The reason I inquired about the toolchain version is because it only makes sense to investigate isssues in the most current released compiler (at this time, CUDA 4.1) and / or the most current release candidate (at this time CUDA 4.2 RC). I would recommend updating your CUDA environment to release 4.1.