solved: cuda single precision code accumulates error while glsl/vulkan works, why?

Hello, using single precision floating point algorithm accumulates big error after ~3000 iterations, but using double precision the same algorithm gives good results in cuda.
But using single precision same algorithm implemented in glsl/vulkan gives good results.

Here is full code: shared vulkan cuda - Google Drive

__global__ void kernel_1(double *n_, double *m_, double *h_, double *I, double *V_YX, uint i, uint i_reg, double dt_sim, uint cellCountX, uint cellCountY) {
    uint x = threadIdx.x + blockIdx.x * blockDim.x;
    uint y = threadIdx.y + blockIdx.y * blockDim.y;
	const double g_Na_max = 4, g_K_add = 0.002, g_L = 0, E_Na = 40, E_L = -60, Area = 7.6e-3;// mm2 3e-8*1e6, 
	double alpha_n = 0, beta_n = 0, alpha_m = 0, beta_m = 0, alpha_h = 0, beta_h = 0, g_Na = 0, g_K1 = 0, g_K2 = 0, i_Na = 0, i_K = 0, i_Leak = 0, i_ion = 0;

	if (x < cellCountX && y < cellCountY) {
	uint adr =  y*cellCountX+x;

        double V = V_YX[adr];
        double n = n_[adr];
        double m = m_[adr];
        double h = h_[adr];

        i_Leak = g_L*(V - E_L);
        //------------potassium
        alpha_n = .0003*(-V - 50) / (exp((-V - 50) / 10) - 1);
        beta_n = .002*exp((-V - 90) / 80);
        n = n + (alpha_n*(1 - n) - beta_n*n)*dt_sim;
        g_K1 = .012*exp((-V - 90) / 50) + .00015*exp((V + 90) / 60);
        g_K2 = .012*n * n * n * n;
        i_K = (g_K1 + g_K2 + g_K_add)*(V + 100);
        //------------ - sodium
        alpha_h = .166*exp((-V - 90) / 20);
        beta_h = 1 / (1 + exp((-V - 42) / 10));
        h = h + (alpha_h*(1 - h) - beta_h*h)*dt_sim;
        alpha_m = .1*(-V - 48) / (exp((-V - 48) / 15) - 1);
        beta_m = .12*(V + 8) / (exp((V + 8) / 5) - 1);
        m = m + (alpha_m*(1 - m) - beta_m*m)*dt_sim;
        g_Na = m * m * m * h * g_Na_max;
        i_Na = (g_Na + .0014)*(V - E_Na);

        i_ion = (i_Na + i_K + i_Leak);

        n_[adr] = n;
        m_[adr] = m;
        h_[adr] = h;

       if (x > 0 && x < cellCountX-1 && y > 0 && y < cellCountY-1 ) { 
           double dx1 = V_YX[cellCountX*y + x - 1] - V; 
           double dx2 = V_YX[cellCountX*y + x + 1] - V;
           double dy1 = V_YX[cellCountX*(y-1) + x] - V;
           double dy2 = V_YX[cellCountX*(y+1) + x] - V;
           double gjX = 125e-9f;
           double gjY = 125e-9f/4.f;
           I[adr] = -Area*i_ion + 1000 * ( gjX*dx1 + gjX*dx2 + gjY*dy1 + gjY * dy2 );
       }
    }
}

__global__ void kernel_2(double *I, double *V_YX, double *V_tYX_reg, uint i, uint i_reg, double dt_sim, double dt_reg, uint cellCountX, uint cellCountY) {
    uint x = threadIdx.x + blockIdx.x * blockDim.x;
    uint y = threadIdx.y + blockIdx.y * blockDim.y;

    const double Cm = .12, Area = 7.6e-3;

    if (x < cellCountX && y < cellCountY) {
        uint adr = cellCountX*y + x;

        V_tYX_reg[i_reg*cellCountX*cellCountY+adr] = V_YX[adr];

        double V = V_YX[adr];

        double up = 100 + 60;
        double I_ext =( (i_reg > 100 && i_reg < 100 + 2) // I_ext Protokolas
                   ||  (i_reg > up && i_reg < up + 2) )
                      &&  (y > 32 - 2 && y < 32 + 2
                        && x > 64 - 2 && x < 64 + 2 ) ? 0.08 : 0;
        V_YX[adr] = V + dt_sim*(I_ext + I[adr]) / (Cm*Area);
    }
}

What you posted is the double-precision version, correct? I wonder what the single-precision version looks like. Is it “float clean”? Also, can you show the compiler invocation you are using to build the single-precision version?

I am not going to spend the time analyzing the numerical properties of this code, which would have to take into account the range of the input data. Just a few notes (these may or may not have an impact on overall results):

[1] I can see at least three instances of linear interpolation which are sub-optimal numerically:

    n = n + (alpha_n*(1 - n) - beta_n*n)*dt_sim;
    h = h + (alpha_h*(1 - h) - beta_h*h)*dt_sim;
    m = m + (alpha_m*(1 - m) - beta_m*m)*dt_sim;

Please see this CUDA tech tip how to interpolate in a numerically advantageous manner, and abstract the interpolation into a function:

[url]https://devblogs.nvidia.com/lerp-faster-cuda/[/url]

[2] What I find irritating are the subtractions in the many calls to exp(), which opens up the code to issues with subtractive cancellation. Note that exp() is an error magnifying function which amplifies any errors present in the function argument. What is the origin of these terms? I notice that all these computations involve only integer literals, which I find strange. Are these numbers taken from a publication? I would re-group the computation to avoid the subtractions, e.g.

exp((-V - 90) / 80) ===> exp(-V/80.0) * 3.246524673583497e-1 (where the magic constant should be pulled out as a symbolic constant with value of exp(-(90.0/80.0)). One might also want to replace the division with a multiplication for performance: exp(-0.0125 * V) * 3.246524673583497e-1. Then
beta_n = 6.493049347166995e-4 * exp(-0.0125 * V). This should be well-behaved numerically.

[3] Instead of using exp(expr)-1, it is numerically better to use expm1(expr), in particular if the result of exp() could be near unity. There are at least three instances of this if I looked correctly.

[later:]

Based on the fact that potassium and sodium are involved, I wonder whether this is an implementation of the Hodgkin–Huxley model of action potential in neurons. The computation seems quite similar to the computation described in a fair amount of detail in Wikipedia ([url]https://en.wikipedia.org/wiki/Hodgkin–Huxley_model[/url]), although the magic numbers involved don’t seem to be the same.

Thank you for reply njuffa,

This is implementation of Noble equations (which are based on Hodgkin Huxley as they involve same ion currents, only some constants differs) describing heart cell electrical behavior. As I am investigating pathological conditions, I changed model constants to fit needs.

Correct, this is double precision version, I change precision by simple replacing all “double” to “float”.

I use Nvidia insight, for build it invokes:

make all 
Building file: ../src/nobleCudaLinux.cu
Invoking: NVCC Compiler
/opt/cuda/bin/nvcc -O3 -gencode arch=compute_52,code=sm_52  -odir "src" -M -o "src/nobleCudaLinux.d" "../src/nobleCudaLinux.cu"
/opt/cuda/bin/nvcc -O3 --compile --relocatable-device-code=false -gencode arch=compute_52,code=compute_52 -gencode arch=compute_52,code=sm_52  -x cu -o  "src/nobleCudaLinux.o" "../src/nobleCudaLinux.cu"
Finished building: ../src/nobleCudaLinux.cu 
Building target: nobleCudaLinux
Invoking: NVCC Linker
/opt/cuda/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_52,code=compute_52 -gencode arch=compute_52,code=sm_52 -link -o  "nobleCudaLinux"  ./src/nobleCudaLinux.o   
Finished building target: nobleCudaLinux"
  1. Thanks for insight about interpolation, I will try this.
  2. Thanks for insight about constant divisions, I left it because I need to change some constants while tuning.

But I tried same algorithm on GLSL compute shader (using vulkan 1.1.89) and it calculates good results even using float’s!
Also CUDA kernel calculates well at the beginning and decay phase of cardiac action potential calculation, but after first action potential when membrane voltage should recover to resting potential, that variable suddenly becomes NaN.

That is not sufficient. To create a float-only version, you also need to change all floating-point literals to “float” by adding an ‘f’ suffix. Without suffix they default to double precision by C++ rules. So in your code you would want to change e.g. .0003 to .0003f for the single-precision version. The use of double-precision literals causes mixed-precision computation, and via C++ implicit typecasts turns some operations into double-precision ones. Mixed-precision computation is difficult to understand numerically.

What I would recommend is pulling out all literal constants into symbolic constants, and templatizing the kernel by type T. Then you can specify constants like so:

const T pi = (T)3.14159265358979323;

It is hard for me to gauge the overall numerical stability of this code by just looking at it. I pointed out three items above that have the potential to cause trouble (please also note item [3]). If numerical stability is poor, you may have just gotten lucky with the GLSL shader (i.e. “happens to work” vs. “works by design”). Or your GLSL and CUDA versions are not in fact equivalent.

Here is a quick experiment you can perform. Add -fmad=false to your compilation. This turns off the merging of floating-point multiply and dependent floating-point add into a single fused multiply-add. If this changes results significantly you should carefully examine the numerical aspects of the code, as the use of fused multiply-add (FMA) typically improves both accuracy (single rounding vs two roundings, protection against some cases of subtractive cancellation) and performance (single operation vs two operations).

Comparing the step-by-step output of a single-precision version versus a double-precision version for representative data should allow you to pinpoint which variables start diverging, and you can then work backwards to understand why they do that. You would want to especially watch out for the effective subtraction of two quantities of similar magnitude anywhere in the code (subtractive cancellation, sometimes also called catastrophic cancellation, although not all cases need be catastrophic).

Floating-point arithmetic is not math (e.g. not associative), and coding equations directly from text books rarely provides satisfactory numerical results, as can be demonstrated by computations as simple as computing the solutions of a quadratic equation. The use of single-precision arithmetic then often highlights such numerical issues, which may remain hidden when doing double-precision computation.

You would also want to run your code with cuda-memcheck to see whether it reports any issues. There may be non-numerical issues such as race-condition that manifest as numerical issues, due to the wrong data being used.

Your insight about adding “f” suffix appear to be correct! Now it calculate correct results and twice faster! (and eight time faster than double precision)
I am thinking why. If some constants were interpreted as double, why error appeared - double is more accurate than float, maybe mixed precision caused arithmetical error that accumulated.
One observation: if I don’t add any of these flags -prec-div -prec-sqrt -fmad simulation result membrane voltage variable suddenly set to NaN value after some simulation time. So your insight about -fmad also was interesting.

Thanks for your work in writing detailed answers and sharing experience!

I will try [1] [2] and [3] optimizations also.
Very interesting

The compiler defaults to -prec-div=true -prec-sqrt=true and for scientific computations it is usually advisable to leave it that way, as this results in the use of correctly rounded divisions and square roots. Likewise, the compiler defaults to -ftz=false, and again, for scientific applications its likely best to leave it as is. The flush-to-zero behavior of -ft=true can lead to nasty numerical artifacts.

NaNs get generated for invalid operations according to the rules of the floating-point standard IEEE-754 (2008). Examples for invalid operations are 0/0, , square root of negative number, logarithm of negative number.

Note that -fmad=true is the default, and the resulting contractions of FMUL+FADD->FMA could result in spurious NaNs in very rare circumstances, due to the ambiguity of contractions: should ab-cd map to fma(a,b,cd) or fma(c,d,ab)? The results will differ and could produce very small unexpected negative results instead of zero. However, the only cases I have encountered in a decade of working with FMA were contrived examples.

In general one would want to keep -fmad=true for both accuracy and performance.