Dropping precision in CUDA?

I hope this question isn’t too silly. I’m afraid I have two version of the same code that I’m trying to get to run the same, but I can’t quite figure out why they’re not. One of the things I noticed though is that the CUDA version is dropping in precision.

For example, if I have
real8 :: bla
8 :: bla_but_GPU

after doing math I get
bla_but_GPU = 4.7935972751360820
bla = 4.793597275136081

While I’ve read this part of the documentation

I can’t find anything on why this dropped digit is happening or how to fix it.
So I guess my question is, is there a way to prevent CUDA from dropping this last digit? And in a similar vein is there a way to force the CUDA code to use a different math library?

There could be multiple reasons for a small difference:

  1. try to disable FMA instructions on the GPU ( -Mcuda=nofma).
  2. if you have reductions in your code, parallel reductions (algorithm usually coded on the GPU) are usually more accurate

Hi Cattaneo,

You may ask this question over on the CUDA forum (https://forums.developer.nvidia.com/c/accelerated-computing/cuda/206) since this one if primarily for questions about the NV HPC Compilers, but I’ll do my best to help.

What your asking is if you can get bit-for-bit reproducible results between a CPU and GPU and this may or may not be possible depending upon the algorithm you’re using. In general the compiler can control the optimizations that it applies to ensure better conformance to the IEEE 754, but things like different accumulation of rounding error due to the order of operations in a parallel context, will rarely be bit for bit comparable.

Also the types of operations used can effect accuracy. For example, if using FMA (Fuse-Multiply-Add) instructions will fuse “x=A+B*C” type operations into a single instruction, rather than splitting them into a multiply followed by an add. There’s less rounding error with an FMA, but may yield slightly differing results than without FMA.

Also keep in mind that IEEE 754 is only accurate up to around 16 places and your difference starts after 15 places. Slight differences in the last place is not unusual. In general, it’s best to compare if two results are within an acceptable tolerance (absolute or relative) rather than check for bit-for-bit comparability.

And in a similar vein is there a way to force the CUDA code to use a different math library?

Not 100% sure what you mean since device code typically doesn’t call libraries. Are you calling CUDA libraries from host code or are you meaning “math.h” things like “cos” and “sin” which are builtin operations (no library call).

If you have a reproducing example to share, that would be helpful.


Well an example of the dropping a digit would be that if I’m reading from a file into a variable for example despite both being called the same thing the GPU one just drops the last digit despite reading the same file.

I used the term library because in the documentation it said

“The consequence is that different math libraries cannot be expected to compute exactly the same result for a given input. This applies to GPU programming as well. Functions compiled for the GPU will use the NVIDIA CUDA math library implementation while functions compiled for the CPU will use the host compiler math library implementation (e.g., glibc on Linux). Because these implementations are independent and neither is guaranteed to be correctly rounded, the results will often differ slightly.”

I assume this means that things like dsqrt or dexp may work a little different and I was wondering if there was a way to rectify that.

I am currently working on a reproducing example but unfortunately the stuff needed is buried a little bit so it may take me a little bit to isolate it as much as possible.

You’re correct, that’s what they mean. I was just clarifying if you were using something like cuBLAS or other CUDA Library.

Also as they state, this situation can occur between various implementation of math libraries, even between CPUs. For example, using IBM’s libmass library on a Power system may yield slightly different result then what you’d see with libm on an x86_64 system. In other words, it’s a general issue when switching between math libraries and not one specific to a GPU and why most validation of floating pointer results with done using a tolerance.