precision

_Moritz · December 15, 2008, 3:11pm

Hello,

I have a question on the precision of floating point operations. The following codes shows an example:

[codebox]#include <cuda.h>

global

void compute(float * data) {

data[0] = data[0] + data[1];

}

int main() {

unsigned int udata1[] = {817504365, 1065158090};

unsigned int udata2[] = {1065158090, 817504365};

float * fdata = (float *)udata1;

float * dfdata;

float debug1, debug2;

cudaMalloc((void **)&dfdata, 2 * sizeof(*dfdata));

cudaMemcpy((void*)dfdata, (void const *)udata1, 2 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug1, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

cudaMemcpy((void*)dfdata, (void const *)udata2, 2 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug2, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

float temp = fdata[0] + fdata[1];

printf("CPU: %.20f GPU1: %.20f GPU2: %.20f\n", temp, debug1, debug2);

printf("CPU: %u GPU1: %u GPU2: %u\n", *(unsigned int *)&temp, *(unsigned int *)&debug1, *(unsigned int *)&debug2);

cudaFree(dfdata);

}[/codebox]

Output:

[codebox]CPU: 0.98836958408355712891 GPU1: 0.98836958408355712891 GPU2: 0.98836958408355712891

CPU: 1065158090 GPU1: 1065158090 GPU2: 1065158090

[/codebox]

every thing is fine; commutativity works (a+b==b+a)

[codebox]#include <cuda.h>

#include <stdio.h>

global

void compute(float * data) {

data[0] = data[0] * data[1] + data[2] * data[3];

}

int main() {

unsigned int udata1[] = {893268102, 989497344, 1065253733, 1065257002};

unsigned int udata2[] = {1065253733, 1065257002, 893268102, 989497344};

float * fdata = (float *)udata1;

float * dfdata;

float debug1, debug2;

cudaMalloc((void **)&dfdata, 6 * sizeof(*dfdata));

cudaMemcpy((void*)dfdata, (void const *)udata1, 6 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug1, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

cudaMemcpy((void*)dfdata, (void const *)udata2, 6 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug2, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

float temp = fdata[0] * fdata[1] + fdata[2] * fdata[3];

printf("CPU: %.20f GPU1: %.20f GPU2: %.20f\n", temp, debug1, debug2);

printf("CPU: %u GPU1: %u GPU2: %u\n", *(unsigned int *)&temp, *(unsigned int *)&debug1, *(unsigned int *)&debug2);

cudaFree(dfdata);

}[/codebox]

Output:

[codebox]CPU: 0.98836958408355712891 GPU1: 0.98836958408355712891 GPU2: 0.98836952447891235352

CPU: 1065158090 GPU1: 1065158090 GPU2: 1065158089

[/codebox]

ab + cd != cd + ab?

The kernel does the first multiplication using the mul instruction and the second with the mad instruction. Both multiplications are done separately. But this time the add is included in the mad.

Can anyone explain me why are the results are different?

Moritz

MisterAnderson42 · December 15, 2008, 3:16pm

Because it is floating point arithmetic. See http://docs.sun.com/source/806-3568/ncg_goldberg.html for more info.

_Moritz · December 15, 2008, 3:29pm

Thank you for the link.

But I do not see how this applies in this case. Because in the first example I used the result of the multiplications of the second example and proved, that commutativity works with the numbers I used. So there is no problem with rounding here.

…Maybe mad uses a higher precision result of its multiplication for the addition…

Smokey · December 16, 2008, 1:48am

The point is, the IEEE 754 specification doesn’t require all implementations store intermediate results the same way - thus data can be lost between operations.

Thus abcd could be quite different on the GPU to the CPU (and in fact, I know it is quite different) due to how the intermediate results of ab/cd/xy are stored, losing units in the last place of the resulting number - thus losing precision, and changing the outcome of the final result.

See http://developers.sun.com/solaris/articles/fp_errors.html for more details, and of course the one and only http://www.validlab.com/goldberg/paper.pdf (which I think was already posted above).

For what it’s worth, there’s no point even trying to make your CUDA an CPU calculations match up exactly, because you might get your current GPU and your current CPU to match up all nice and fine, but when you run it on an older/newer CPU/GPU, or a different C/C++/CUDA runtime, you will quite simply, get different results.

There are ways to increase the precision of FP calculations, as noted in the article linked above - however they’re not always practical for complex algorithms, and will of course eat into the performance of your kernel.

Edit: Similarly, even when comparing FP results on the same architecture, abcd isn’t guaranteed to be the same as bcda (for example), again due to how intermediate results are stored - thus ULPs differ from loss of precision, etc, etc, etc…

Second Edit: The CUDA compiler also optimizes + & *'s into fmads (as you’ve guessed), which is actually lower precision than plain old muls and adds as stated in the programming guide. use __fadd_rn( a, b ) and __fmul_rn( a, b ) to avoid the compiler optimizing your code into fmads. (Sorry I didn’t address this before, kind of went off on a tangent)

Topic		Replies	Views
Why accuracy CPU and GPU not equal? CUDA Programming and Performance	6	10927	October 28, 2014
Floating Point Accuracy CUDA Programming and Performance	11	30418	April 6, 2013
CPU and GPU floating point calculations Results are different CUDA Programming and Performance	6	21925	August 7, 2010
CPU and GPU Floating point anomaly CUDA Programming and Performance	10	5622	November 10, 2013
Is there a difference between GPU double precision and CPU double precision? CUDA Programming and Performance	14	10626	November 26, 2009
Why are the calculations different between CPU and GPU? CUDA Programming and Performance	2	840	February 7, 2020
Floating-point precision problems CUDA Programming and Performance	14	4386	January 7, 2011
Precision of floats does CUDA use half precision instead of single precision for floats? CUDA Programming and Performance	5	2274	March 15, 2010
solved: cuda single precision code accumulates error while glsl/vulkan works, why? CUDA Programming and Performance	5	937	November 5, 2018
Help Needed: Precision Mismatch between GPU and CPU Calculations of AAD Limiter CUDA Programming and Performance	3	27	December 23, 2024

precision

Related topics