Dropping precision in CUDA?

Cattaneo · December 14, 2020, 8:25pm

I hope this question isn’t too silly. I’m afraid I have two version of the same code that I’m trying to get to run the same, but I can’t quite figure out why they’re not. One of the things I noticed though is that the CUDA version is dropping in precision.

For example, if I have
real8 :: bla
real8 :: bla_but_GPU

after doing math I get
bla_but_GPU = 4.7935972751360820
bla = 4.793597275136081

While I’ve read this part of the documentation

I can’t find anything on why this dropped digit is happening or how to fix it.
So I guess my question is, is there a way to prevent CUDA from dropping this last digit? And in a similar vein is there a way to force the CUDA code to use a different math library?

mfatica · December 14, 2020, 10:22pm

There could be multiple reasons for a small difference:

try to disable FMA instructions on the GPU ( -Mcuda=nofma).
if you have reductions in your code, parallel reductions (algorithm usually coded on the GPU) are usually more accurate

MatColgrove · December 14, 2020, 10:49pm

Hi Cattaneo,

You may ask this question over on the CUDA forum (CUDA - NVIDIA Developer Forums) since this one if primarily for questions about the NV HPC Compilers, but I’ll do my best to help.

What your asking is if you can get bit-for-bit reproducible results between a CPU and GPU and this may or may not be possible depending upon the algorithm you’re using. In general the compiler can control the optimizations that it applies to ensure better conformance to the IEEE 754, but things like different accumulation of rounding error due to the order of operations in a parallel context, will rarely be bit for bit comparable.

Also the types of operations used can effect accuracy. For example, if using FMA (Fuse-Multiply-Add) instructions will fuse “x=A+B*C” type operations into a single instruction, rather than splitting them into a multiply followed by an add. There’s less rounding error with an FMA, but may yield slightly differing results than without FMA.

Also keep in mind that IEEE 754 is only accurate up to around 16 places and your difference starts after 15 places. Slight differences in the last place is not unusual. In general, it’s best to compare if two results are within an acceptable tolerance (absolute or relative) rather than check for bit-for-bit comparability.

And in a similar vein is there a way to force the CUDA code to use a different math library?

Not 100% sure what you mean since device code typically doesn’t call libraries. Are you calling CUDA libraries from host code or are you meaning “math.h” things like “cos” and “sin” which are builtin operations (no library call).

If you have a reproducing example to share, that would be helpful.

-Mat

Cattaneo · December 15, 2020, 9:27pm

Well an example of the dropping a digit would be that if I’m reading from a file into a variable for example despite both being called the same thing the GPU one just drops the last digit despite reading the same file.

I used the term library because in the documentation it said

“The consequence is that different math libraries cannot be expected to compute exactly the same result for a given input. This applies to GPU programming as well. Functions compiled for the GPU will use the NVIDIA CUDA math library implementation while functions compiled for the CPU will use the host compiler math library implementation (e.g., glibc on Linux). Because these implementations are independent and neither is guaranteed to be correctly rounded, the results will often differ slightly.”

I assume this means that things like dsqrt or dexp may work a little different and I was wondering if there was a way to rectify that.

I am currently working on a reproducing example but unfortunately the stuff needed is buried a little bit so it may take me a little bit to isolate it as much as possible.

MatColgrove · December 15, 2020, 10:01pm

You’re correct, that’s what they mean. I was just clarifying if you were using something like cuBLAS or other CUDA Library.

Also as they state, this situation can occur between various implementation of math libraries, even between CPUs. For example, using IBM’s libmass library on a Power system may yield slightly different result then what you’d see with libm on an x86_64 system. In other words, it’s a general issue when switching between math libraries and not one specific to a GPU and why most validation of floating pointer results with done using a tolerance.

Topic		Replies	Views
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	1196	August 9, 2023
Precision Fail CUDA Programming and Performance	5	10609	March 10, 2011
CPU and GPU Floating point anomaly CUDA Programming and Performance	10	6044	November 10, 2013
CPU and CUDA code yield different results? CUDA Programming and Performance	3	1222	June 28, 2013
Double Precision errors Legacy PGI Compilers	5	2710	June 12, 2018
Single Precision Accuracy CUDA Programming and Performance	9	9331	October 6, 2010
Possible Rounding/Precision Errors in CUDA Math APIs? GPU-Accelerated Libraries math-api	5	387	July 31, 2024
FMA precision issue CUDA Programming and Performance	9	19589	November 21, 2010
Is CUDA's implementation of 64-bit floating precision in practice subpar to that of Fortran? CUDA Programming and Performance	2	1347	December 15, 2021
Double precision Accuracy with sqrt, log math functions Results on CPU & GPU are not exactly sam CUDA Programming and Performance	9	5629	April 12, 2012

Dropping precision in CUDA?

Related topics