I found a very surprising behavior of cosf function.

I execute the following kernel on a 128*1024 vector.

```
__global__ void calc_device(float *data) {
const int tid = blockIdx.x*blockDim.x + threadIdx.x;
const float x = data[tid];
data[tid] = 3.0f*x*x*cosf(x*x + x + 1);
}
```

All the component of the vector are in [0,1].

I compute the relative error between the GPU result and the same function computed on the CPU and got this:

- when compiled without the -G option relative error around 2.10^-7 (thats fine)
- when compiled without the -G option relative error around .001 (thats not acceptable)

I done several test with 5.0 and 4.2 toolkit on compute capability 2.0 and 3.0, with and without optimization -O and got the same result that exclusively depends on the -G flag.

I tried also to compile with --use_fast_math in that case the -G flag has no influence and the relative error is around 0.05 (5% sound surprising high for me).

I don’t understand this kind of behavior. Why the generation of debuggable device code drastically modify the precision of

cosf ?

Can somebody give me the exact mathematical implementation of cosf and __cosf ?

Any idea would be greatly appreciated.

In case you will find the full code below.

code.zip (3.54 KB)