What are the ways to improve the closeness of the results for float operations on the GPU to the CPU? one thing I tried was using the arch:SEE options for the compiler, which does improve my results considerably.
Are there other such tricks? Any other things to keep in mind while performing float arithmetic on the GPU for closer results?