discrepancy between CPU and GPU after a division (accuracy issue)

Hello all

I am currently migrating some CPU code on GPU and I am facing accuracy issue. I have this simple code
on GPU

__device__ float Simu_ExpoReduitGPU(float const a,
		                    float const b,
		                    float const p1,
		                    curandState * pState){
  float random, x, C;
  C=p1/(1-expf(-p1*b));
  float test = 1-expf(-p1*b) ;
  float testp1 = p1/test ;
  random=0.5;
  x=-(log(1-random*p1/C))/p1;
  printf("EXPO %f %f %f %f %f %f %f\n",x,p1,b,C,p1*b,test,testp1) ;
  return x;
}

On CPU:

float Simu_ExpoReduit(float const a, 
                      float const b, 
                      float const p1){ 
  float random, x, C;
  C=p1/(1-expf(-p1*b));
  float test = 1-expf(-p1*b) ;
  float testp1 = p1/test ;
  //float random= (float) rand()/RAND_MAX; Generateur C++ [0,1]
  random=0.5;
  x=-(log(1-random*p1/C))/p1;
  printf("EXPO %f %f %f %f %f %f %f\n",x,p1,b,C,p1*b,test,testp1) ;
  return x;
}

This piece of code is supposed to simulate an exponential law.
When I run this code on CPU I got the following result

EXPO 0.249519 0.015359 0.500000 2.007695 0.007679 0.007650 2.007695

on GPU
EXPO 0.249523 0.015359 0.500000 2.007679 0.007679 0.007650 2.007679

the variable testp1 has different values between CPU and GPU. In fact I will have accpeted some differences on the last digit, but in this case the two last digit are different ?
Is it normal ?

I have read

and

http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

but I haven’t been able to decide if this diff is expected or not.

I am compiling with cuda 5.5.

My GPU is a GTX680 (CC 3.0)

My CPU is a Intel® Xeon® CPU E5-2670 0 @ 2.60GHz
My architecture is x86_64
My kernel is : 3.2.68-server

Thank you very much for your help

Its probably better if you provide a complete, simplified test case.

The very first value you calculate in your function is C, and you are printing out C, and it seems to be the same as what you are reporting for testp1, and it is different between host and device.

You should provide a complete executable code, not just a function. Also in addition to the other information you hvae provided, indicate the exact compile command you are using,.

When I compile the following code, on linux, with CUDA 7, and run on a K40c, I get numbers that seem to match between host and device:

$ cat t742.cu
#include <stdio.h>

#define V1 0.79f
#define V2 1.274f

__host__ __device__ float test(float const b, float const p1){

  return p1/(1-expf(-p1*b));
}

__global__ void mykernel(){

  float C = test(V1, V2);
  printf("GPU = %f\n", C);
}

int main(){

  printf("CPU = %f\n", test(V1, V2));
  mykernel<<<1,1>>>();
  cudaDeviceSynchronize();
  return 0;
}
$ nvcc -arch=sm_35 -o t742 t742.cu
$ ./t742
CPU = 2.007914
GPU = 2.007914
$

In general, simple comparison of CPU and GPU results to establish some sort of correctness measure is poor methodology, as either set of computations includes accumulated error, and thus may deviate from the true mathematical result, with no apparent indication which is closer. The standard adage about this is that “the man with one watch always know what time it is, the man with two watches can never be sure”.

A better way to gauge quality is frequently to compare to a higher precision reference. See also NVIDIA’s floating-point white paper for a discussion of the issues surrounding simple host to device comparisons: https://developer.nvidia.com/sites/default/files/akamai/cuda/files/NVIDIA-CUDA-Floating-Point.pdf

The first thing you would want to examine is your compilation settings. If you compile the CUDA code with nvcc -arch=sm_30, single-precision division delivers correctly rounded results according to the IEEE-754 floating-point standard. If you specify flags like -prec-div=false, or -use_fast_math, a faster, approximate division will be substituted. Likewise, the optimization flag settings of your host compiler may cause division to be approximate rather than correctly rounded.

The differences in the final results may have nothing to do with the division per se. The host compiler may use a precision other than single-precision to evaluate the expression. Likewise the generation of FMA (fused multipy-add instructions) may improve accuracy on the GPU relative to the CPU.

A very likely source of numerical differences in this computation are the expf() results, as the accuracy of expf() is not governed by C/C++ standards; you can find the error bounds for CUDA math functions documented in an appendix of the CUDA C Programming Guide. The single-precision math functions in CUDA favor performance over accuracy based on typical use cases, nonetheless they typically have a maximum error of a few ulps. Small numerical differences may be amplified through further computation, e.g. through subtractive cancellation.

If you want to try a more accurate but slower expf() implementation than that provided by CUDA, check out the one I posted to Stackoverflow a small time back:

http://stackoverflow.com/questions/29381117/which-exponentiation-algorithms-do-cpu-programming-languages-use/29382053#29382053

Hello

I know it is a little bit late but thank you for your help.
Finally, I have been able to run some few test and it looks like cuda is computing the good result in my case. The problem seems to be related to one of the CPU optimization.

Regards