You may ask this question over on the CUDA forum (https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206) since this one if primarily for questions about the NV HPC Compilers, but I’ll do my best to help.
What your asking is if you can get bit-for-bit reproducible results between a CPU and GPU and this may or may not be possible depending upon the algorithm you’re using. In general the compiler can control the optimizations that it applies to ensure better conformance to the IEEE 754, but things like different accumulation of rounding error due to the order of operations in a parallel context, will rarely be bit for bit comparable.
Also the types of operations used can effect accuracy. For example, if using FMA (Fuse-Multiply-Add) instructions will fuse “x=A+B*C” type operations into a single instruction, rather than splitting them into a multiply followed by an add. There’s less rounding error with an FMA, but may yield slightly differing results than without FMA.
Also keep in mind that IEEE 754 is only accurate up to around 16 places and your difference starts after 15 places. Slight differences in the last place is not unusual. In general, it’s best to compare if two results are within an acceptable tolerance (absolute or relative) rather than check for bit-for-bit comparability.
And in a similar vein is there a way to force the CUDA code to use a different math library?
Not 100% sure what you mean since device code typically doesn’t call libraries. Are you calling CUDA libraries from host code or are you meaning “math.h” things like “cos” and “sin” which are builtin operations (no library call).
If you have a reproducing example to share, that would be helpful.