Choosing known source code implementations of transcendental functions – such as those frequently posted by Norbert Juffa (njuffa) on this forum – could be a reasonable workaround for undocumented changes that nVidia have made to the official implementations.
Changes to optimizations affecting floating-point arithmetic are typically minor. There should be no expectation of bitwise identical results across different version of a tool chain, on any platform. Have you read NVIDIA’s floating-point whitepaper for background?
If relatively small changes in the tool chain’s handling of floating-point computation lead to significantly different final results, this is a pretty good indication that your software implementation lacks numerical stability, something you might want to investigate.
Orthogonal to that effort, in order to recommend mitigation steps, you would have to first narrow down which section of code is the root cause of the observed differences. Two common scenarios in the context of CUDA are: (1) compiler changes affecting contraction of FMUL followed by FADD into FMA (fused multiply-add); (2) accuracy improvements to transcendental functions in the standard math library.
If you use floating-point atomics, there is an indeterminate order of the operations, and because floating-point operations are generally not associative, results may differ. I am pretty sure that is spelled out in the whitepaper I mentioned.