precision

Hello,

I have a question on the precision of floating point operations. The following codes shows an example:

[codebox]#include <cuda.h>

#include <stdio.h>

global

void compute(float * data) {

data[0] = data[0] + data[1];

}

int main() {

unsigned int udata1[] = {817504365, 1065158090};

unsigned int udata2[] = {1065158090, 817504365};

float * fdata = (float *)udata1;

float * dfdata;

float debug1, debug2;

cudaMalloc((void **)&dfdata, 2 * sizeof(*dfdata));

cudaMemcpy((void*)dfdata, (void const *)udata1, 2 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug1, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

cudaMemcpy((void*)dfdata, (void const *)udata2, 2 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug2, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

float temp = fdata[0] + fdata[1];

printf("CPU: %.20f GPU1: %.20f GPU2: %.20f\n", temp, debug1, debug2);

printf("CPU: %u GPU1: %u GPU2: %u\n", *(unsigned int *)&temp, *(unsigned int *)&debug1, *(unsigned int *)&debug2);

cudaFree(dfdata);

}[/codebox]

Output:

[codebox]CPU: 0.98836958408355712891 GPU1: 0.98836958408355712891 GPU2: 0.98836958408355712891

CPU: 1065158090 GPU1: 1065158090 GPU2: 1065158090

[/codebox]

every thing is fine; commutativity works (a+b==b+a)

[codebox]#include <cuda.h>

#include <stdio.h>

global

void compute(float * data) {

data[0] = data[0] * data[1] + data[2] * data[3];

}

int main() {

unsigned int udata1[] = {893268102, 989497344, 1065253733, 1065257002};

unsigned int udata2[] = {1065253733, 1065257002, 893268102, 989497344};

float * fdata = (float *)udata1;

float * dfdata;

float debug1, debug2;

cudaMalloc((void **)&dfdata, 6 * sizeof(*dfdata));

cudaMemcpy((void*)dfdata, (void const *)udata1, 6 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug1, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

cudaMemcpy((void*)dfdata, (void const *)udata2, 6 * sizeof(*fdata), cudaMemcpyHostToDevice);

compute<<<1,1>>>(dfdata);

cudaMemcpy((void*)&debug2, (void const *)dfdata, 1 * sizeof(*fdata), cudaMemcpyDeviceToHost);

float temp = fdata[0] * fdata[1] + fdata[2] * fdata[3];

printf("CPU: %.20f GPU1: %.20f GPU2: %.20f\n", temp, debug1, debug2);

printf("CPU: %u GPU1: %u GPU2: %u\n", *(unsigned int *)&temp, *(unsigned int *)&debug1, *(unsigned int *)&debug2);

cudaFree(dfdata);

}[/codebox]

Output:

[codebox]CPU: 0.98836958408355712891 GPU1: 0.98836958408355712891 GPU2: 0.98836952447891235352

CPU: 1065158090 GPU1: 1065158090 GPU2: 1065158089

[/codebox]

ab + cd != cd + ab?

The kernel does the first multiplication using the mul instruction and the second with the mad instruction. Both multiplications are done separately. But this time the add is included in the mad.

Can anyone explain me why are the results are different?

Moritz

Because it is floating point arithmetic. See http://docs.sun.com/source/806-3568/ncg_goldberg.html for more info.

Thank you for the link.

But I do not see how this applies in this case. Because in the first example I used the result of the multiplications of the second example and proved, that commutativity works with the numbers I used. So there is no problem with rounding here.

…Maybe mad uses a higher precision result of its multiplication for the addition…

The point is, the IEEE 754 specification doesn’t require all implementations store intermediate results the same way - thus data can be lost between operations.

Thus abcd could be quite different on the GPU to the CPU (and in fact, I know it is quite different) due to how the intermediate results of ab/cd/xy are stored, losing units in the last place of the resulting number - thus losing precision, and changing the outcome of the final result.

See http://developers.sun.com/solaris/articles/fp_errors.html for more details, and of course the one and only http://www.validlab.com/goldberg/paper.pdf (which I think was already posted above).

For what it’s worth, there’s no point even trying to make your CUDA an CPU calculations match up exactly, because you might get your current GPU and your current CPU to match up all nice and fine, but when you run it on an older/newer CPU/GPU, or a different C/C++/CUDA runtime, you will quite simply, get different results.

There are ways to increase the precision of FP calculations, as noted in the article linked above - however they’re not always practical for complex algorithms, and will of course eat into the performance of your kernel.

Edit: Similarly, even when comparing FP results on the same architecture, abcd isn’t guaranteed to be the same as bcda (for example), again due to how intermediate results are stored - thus ULPs differ from loss of precision, etc, etc, etc…

Second Edit: The CUDA compiler also optimizes + & *'s into fmads (as you’ve guessed), which is actually lower precision than plain old muls and adds as stated in the programming guide. use __fadd_rn( a, b ) and __fmul_rn( a, b ) to avoid the compiler optimizing your code into fmads. (Sorry I didn’t address this before, kind of went off on a tangent)