Why accuracy CPU and GPU not equal?

Why accuracy CPU and GPU not equal?

#include <cuda.h>
#include <stdio.h>

__global__ void calcGPU() {
	float x = 0.000001f;
	float y = expf(x);
	printf("GPU: %.9lg\n", y);
}

void calcCPU() {
	float x = 0.000001f;
	float y = expf(x);
	printf("CPU: %.9lg\n", y);
}

int main() {

	calcGPU<<<1, 1>>>();
	cudaDeviceReset();
	calcCPU();
	return 0;
}

out:

GPU: 1.00000107
CPU: 1.00000095

GPU

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Quadro K600”
CUDA Driver Version / Runtime Version 6.5 / 6.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 1024 MBytes (1073414144 bytes)
( 1) Multiprocessors, (192) CUDA Cores/MP: 192 CUDA Cores
GPU Clock rate: 876 MHz (0.88 GHz)
Memory Clock rate: 891 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.5, CUDA Runtime Version = 6.5, NumDevs = 1, Device0 = Quadro K600

CPU

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel® Core™2 Duo CPU E6850 @ 3.00GHz
stepping : 11
microcode : 0xba
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips : 5984.72
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel® Core™2 Duo CPU E6850 @ 3.00GHz
stepping : 11
microcode : 0xba
cpu MHz : 1998.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips : 5984.72
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

System

Linux sneg 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:28:38 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Neither the IEEE-754 (2008) floating-point standard nor the ISO C and ISO C++ language standards prescribe a particular accuracy for the exponential functions exp() or expf(). This means that library implementers are free to make their own choices regarding tradeoffs between accuracy and performance. As a consequence, no two standard math libraries are likely to deliver bit-identical results for a particular math function called with a particular argument. You will make similar observations when comparing results from different math libraries on the CPU. Some libraries even offer multiple selectable accuracy modes within the library itself, and the results differ based on mode.

In the case at hand both libraries deliver results with a small ulp error, which in the case of CUDA is within the math function error bound stated in an appendix of the Programming Guide:

CPU res: 1.00000095e+000 bit pattern: 3f800008 ulp error vs mathematical: 0.388612
GPU res: 1.00000107e+000 bit pattern: 3f800009 ulp error vs mathematical: 0.611388

The CPU delivers the correctly rounded single-precision result, while CUDA’s result differs from the correctly rounded result by 1 ulp.

Another reason could be cpu uses 80 bit floating point precision internally, not sure if this applies to floats as well.

Perhaps that function uses doubles internally which could explain the accuracy difference.

Your results are well within single precision accuracy (10^-7) ie neither the CPU or GPU result is more accurate than the other.

GPU  =1.00000107;
>> CPU = 1.00000095;
>> rel_diff = abs(CPU-GPU)/CPU

rel_diff = 1.2000e-007

Generally though GPUs were more quick to adopt IEEE-754-2008 than for example intel CPUs which meant that for example FMA operations GPUs would be more accurate (nowadays i believe all intel CPUs are 754-2008 compliant).

As I already pointed out in #2, the difference between CPU and GPU result here is just one single-precision ulp (unit in the last place), that is, the least significant bit differs by 1, which is the smallest possible difference between two finite-precision floating-point numbers.

In this example, the CPU result is the correctly rounded single-precision result. Since there is only a single test vector for comparison, we do not know how the CPU implementation compares to the CUDA implementation in general. If we assume that the CPU math function proves to be more accurate overall, that could be due to any number of design goals and implementation choices, including but not limited to the use of higher precision in intermediate computations.

For many of the simpler math functions such as exp(), log(), atan(), there are known methods to achieve correctly rounded results for all possible arguments, however this comes at significant cost to execution speed. Since math libraries have to serve the needs of a large and diverse set of use cases and applications, targeting correctly rounded results regardless of performance is rarely the appropriate choice, although it provides advantageous properties including bit-wise identical results across software and hardware platforms, which is why basic floating-point arithmetic was standardized to deliver correctly rounded results by IEEE-754 back in 1985.

I was the original designer of the CUDA math library and maintained it for many years. The design criteria for the single-precision math functions I used at the outset were roughly these:

(1) Most applications that are based mostly or completely on single-precision computation require performance first and foremost, but can tolerate small relative error. This applies specifically to use cases that could roughly be characterized as signal processing, including image processing. The reason is that sensor data often has only 12 or 16 bits of resolution, but there is a huge quantity of data and there are often soft real-time processing requirements.

(2) All intermediate computations must be performed in single precision, as double precision support may be slow or non-existent. The latter no longer applies now that CUDA has dropped support for platforms < sm_13, but double precision may still be very slow.

(3) Where ever possible, single-precision math functions should take advantage of hardware acceleration, specifically the functionality of the “special function units”.

(4) Special case handling shall be compliant with the ISO C99 standard if at all possible. For example some math functions are defined with considerations for a dynamic rounding mode which is not supported by GPU hardware. In these cases the CUDA math library functions behave as if that mode is permanently forced to round to-nearest-or-even.

(5) There are specific considerations that make it desirable for some functions to be more accurate than others, based on typical usage including the use as building block for other math functions. For example, the accuracy of log() is typically more important than the accuracy of exp().

The resulting error bounds for single precision math functions are documented in an appendix to the CUDA Programming Guide. Generally speaking they tend to be slightly larger than the error bounds for the corresponding double-precision functions which have their own, somewhat different set of design criteria.

Thanks for your help!

CUDA Programming Guide:
http://docs.nvidia.com/cuda-c-programming-guide/index.html#mathematical-functions-appendix
http://docs.nvidia.com/floating-point/index.html#mathematical-function-accuracy