Strange behavior of cosf function (possible bug ?)

I found a very surprising behavior of cosf function.

I execute the following kernel on a 128*1024 vector.

__global__ void calc_device(float *data) {
	const int tid = blockIdx.x*blockDim.x + threadIdx.x;

	const float x = data[tid];
	data[tid] = 3.0f*x*x*cosf(x*x + x + 1);
}

All the component of the vector are in [0,1].

I compute the relative error between the GPU result and the same function computed on the CPU and got this:

  • when compiled without the -G option relative error around 2.10^-7 (thats fine)
  • when compiled without the -G option relative error around .001 (thats not acceptable)

I done several test with 5.0 and 4.2 toolkit on compute capability 2.0 and 3.0, with and without optimization -O and got the same result that exclusively depends on the -G flag.

I tried also to compile with --use_fast_math in that case the -G flag has no influence and the relative error is around 0.05 (5% sound surprising high for me).

I don’t understand this kind of behavior. Why the generation of debuggable device code drastically modify the precision of
cosf ?

Can somebody give me the exact mathematical implementation of cosf and __cosf ?

Any idea would be greatly appreciated.

In case you will find the full code below.

code.zip (3.54 KB)

(1) Does your app check the return status of all CUDA API calls and kernel launches?
(2) Have you run the app with cuda-memcheck to see whether there are out-of-bounds accesses?

If no errors are reported by the CUDA runtime and cuda-memcheck reports no issues either, can you post a minimal, complete program that reproduces the problem, using the smallest array possible?

The -G switch controls whether the code is compiled in a way that is suitable for debugging. With -G, all optimizations normally applied to device code are turned off, best I know. The -O flag of NVCC is passed through to the host compiler, for use with the host portion of the code, so it makes sense that changing that would not affects the behavuior of the device code.

Yes.

Yes, no issues.

I just attached the code in my previous message. It’s the first example for my students so it’s a already minimal code.

Ok.

Thanks. I will try to look at this today. I notice that the inputs to cosf() can become quite large, on the order of 1e10. This should not be a problem as the trigonometric functions are designed to deliver correct results across the entire range (this is covered well by tests). Strike that, I misread “data[tid]” as just “tid” …

The error bounds for cosf() and __cosf() are documented in the CUDA C Programming Guide. You can inspect the source code for cosf(), it is contained in the header file math_functions.h which is installed together with all the other CUDA header files.

The issue may be with the argument passed to cosf(). If x can be negative, there may be subtractive cancellation in the computation of x*x + x + 1, and this may be mitigated by the use of FMA (fused multiply-add) when the code is compiled with optimizations on the GPU. As a consequence, the GPU results would be more accurate than the CPU results.

To do a quick check of this hypothesis, please add -fmad=false to your nvcc invocation (this flag turns off the automatic generation of FMAs). If adding this switch makes the problem disappear, I would suggest reading the following NVIDIA whitepaper: http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

Thanks for the link to that paper, might use it as a reference in one of the CUDA algorithm papers I’m writing

A core problem is you’re assuming that the relative error should be small. That assumption is unfounded.

When the value itself is near zero, the floating point errors may be tiny in absolute terms, but huge in relative terms.

As a specific example, for an input value of around 0.40597… the argument becomes very close to 0.0 and the relative error can become quite large.

I ran a quick test on the CPU, computing values in double and float precision.
The argument was chosen such that it was within your [0…1] range, and exactly representable as a float, so the computational difference is entirely from the function computation, not the argument representation.

Double precision: x=0.40597811341285705566 f(x) = -0.00000002161330137085 
Float precision:  x=0.40597811341285705566 f(x) = -0.00000000750988806082

Absolute error is 1.410341331003e-8

Relative error is 1.877979165042

x is in [0,1] so x*x + x + 1 is in [0, 3] and < Pi. That’s not the worst situation for cos computation ! Everything is positive so no substractive cancellation error.

Compiling with -fmad=false solve the issue. By the way this change also the code generated for the implemntation of cosf.
I just made i quick test computing 3.0fxx*(xx + x + 1) instead of 3.0fxxcos(x*x + x + 1). In that case (for 1M elements) the relative error is 0 with -fmad=false and 2.10^-7 otherwise. So i think there something strange happening with the cosf implementation when compiled without -fmad=false.

Another strange thing without -fmad=false the relative error increase with the number of elements:

  • 10K -> 6.10^-5
  • 100K -> 0.001
  • 1M -> 0.02
    but here is no reduction, no communication between threads …

The CUDA math library is a set of header files with inline functions that are subject to any compiler flag chosen by the user. However, I am reasonably sure I put in sufficient safe-guards such that all documented accuracy targets are being maintained regardless of -fmad={true|false}. I will first check whether I can reproduce the behavior your are seeing, and if so, will need to analyze a few of those cases in detail.

What GPU are you running on? I have sm_13, sm_21, and sm_35 available to me. I think the subexpression x*x+x to these difference. On sm_1x, with optimization, this will map to FMAD, which is less accurate than separate FMUL + FADD. For sm_20 and above, with optimization, this wll map to FFMA, which is more accurate. In either case the results will differ from separate FMUL + FADD, which is what you get on the CPU (unless you are targetting x87 in which case the CPU may provide excess precision; see the whitepaper).

Anyhow, I am sure we can get to the bottom of this.

I run the code on sm_20 and sm_30.
You are right about xx+x+1, i made a test with cos(x) and x in [0, Pi[ instead of cos(xx+x+1) and it’s ok (relative error order of magnitude 10^-7). So cosf is not the problem.
The question remains how FMAD in a simple expression like x*x + x + 1 lead to this strange result ?

SOLVED !

Here is the explanation:

  • without -fmad=false x*x + x +1 give slightly different results
  • when x*x + x + 1 is near +/-pi/2 a small difference between cpu and gpu (7.58e-08 relative error) lead to
    a 0.00014 relative error in cos
  • more random elements i have in my array bigger is the probability to have such a situation
    devil in details.

Thank for your help.

I have to admit this error magnification effect doesn’t make sense to me at the moment. I will look at a specific case to gain clarity.

Since your target is sm_2x / sm_3xm the use of FMA to evaluate x*x + x + 1 as fadd (fma (x, x, x), 1) should provide a more accurate result, which is easily checked by using a double precision computation as reference.

It’s not an error magnification it’s only about the relative error : relerr = |(cpu - gnu)/cpu|
When the result is close to 0 the relative error is quite big 0.0014 even with an absolute error around 10^-7.

By the way the result remains in a case like that FMA leads to 0.1% error respect to the cpu.

It’s not an accuracy problem: both results (cpu and gnu) are accurate respect to the way they are computed.

What surprise me is the order of magnitude of the difference but after accurate analysis it’s perfectly fine.

By the way can you change the title of subject to something like “FMA induced differences between CPU and GPU computation” ?
Indeed the poor cosf as nothing to do with that…

OK, I got my mind wrapped around this now. The magnitude of the slope of cos() around pi/2 is approximately 1, so the absolute difference in the input to cos() is roughly equal to the absolute difference in the output of cos(). But since cos() approaches 0 in the vicinity of pi/2, a small absolute difference can result in a very large relative difference. A worked example:

x = 4.059810042e-01

cosine arg = 1.570801497e+00 (hex: 3fc91006) with FMA
cosine arg = 1.570801616e+00 (hex: 3fc91007) without FMA
The absolute difference in the cos arguments is about 1.19e-7 (or 1 ulp)

cos result = -5.169710676e-06 (hex: b6ad777a) with FMA
cos result = -5.288919965e-06 (hex: b6b1777a) without FMA
The absolute difference in the cos results is about 1.192e-7

The relative difference in the results is 2.253941074e-02

For this particular case the cos argument computed without the use of FMA is closer to the reference argument computed with double precision, 1.5708015800403281e+00. The argument computed with FMA has a relative error of -5.28e-8, while the argument computed without FMA has a relative error of 2.29e-8.

I am afraid only moderators can change the title of forum discussions, I am not one of them.