"emulating" emulation mode float results? float calculations giving much different results

Hi everyone,

What are the ways to improve the closeness of the results for float operations on the GPU to the CPU? one thing I tried was using the arch:SEE options for the compiler, which does improve my results considerably.

Are there other such tricks? Any other things to keep in mind while performing float arithmetic on the GPU for closer results?

Any help will be highly appreciated.

TIA,
aj

will using __fmul_rn/__fadd_rn help?