Float accuracy : OpenCL and CUDA

Hello,

I have a CUDA kernel that I just ported in OpenCL, and I get different results.

Precisely, the results are identical everywhere on a 100x50x50 3D matrix, except for two columns. Let me summarize :

This what the input looks like (not actually 1 and 0.1)

1 1 1 1 1 0.1 0.1 0.1 0.1 0.1

1 1 1 1 1 0.1 0.1 0.1 0.1 0.1

1 1 1 1 1 0.1 0.1 0.1 0.1 0.1

1 1 1 1 1 0.1 0.1 0.1 0.1 0.1

1 1 1 1 1 0.1 0.1 0.1 0.1 0.1

The output matrix is computed by finite differences (degree 1 in space)

1 1 1 1 X X.0.1 0.1 0.1 0.1

1 1 1 1 X X.0.1 0.1 0.1 0.1

1 1 1 1 X X.0.1 0.1 0.1 0.1

1 1 1 1 X X.0.1 0.1 0.1 0.1

1 1 1 1 X X 0.1 0.1 0.1 0.1

The X are the places where CUDA and OpenCL give different values, about 10^-6 relative error

I tried to change the compilation option (with and without --fast-math, --cl-fast-relaxed-math…), but I always get different results.

The computation involves :

(CUDA <-> OpenCL)

    fmaxf <-> max

    fminf <-> min

    sqrtf <-> sqrt

    [], /, +

I wondered if this is a normal behavior. Perhaps the sqrt implementation is different ?

There are multiple sqrt possibilities in OpenCL, which one should I choose to ensure having the same results ?

I read that CUDA implements sqrt by 1/rsqrt, should I do explicitly 1.0f/rsqrt(x) in my OpenCL code ?)

Does it come from the optimizatiojn step ? (involving fmad or something like that)

Does it come from me ? (but this is a simple kernel without any synchronization needs, so translating it in OpenCL is just about a ‘sed’)

Any comment appreciated :)

Thank you !

PS : using CUDA 3.1, driver 256.40

Single precision numbers have only ~6 digits of precision. So 10^-6 relative error is essentially identical.

Yes, obviously, but for an extensive non-regression test, having exactly the same values would be very helpful.

I’m not saying that it has to be the same, I just want to have control on those differences… !

The fact is that, even with disabling all math and other optimizations in both CUDA and OpenCL, I get different values… then if there is actually different implementations of an operator, it is a precious information to have !

Exactly the same result is probably impossible. It’s enough that the compiler optimizes mad differently, or changes like order and you will get different results.

Not sure if opencl supports it, but if you declare all your variables as volatile and use intrinsic add and mul functions to disable mad you may manage to get the same results.

Hmm, reading your answer I realize I wasn’t thinking as a GPU compiler developer :)

Thank you very much for opening my eyes !