I need an algorithm running on the host to generate the same results as one running on the GPU, and I’m getting odd results with the following. I have identical code running on the CPU and the GPU, and the segments and results are:

GPU & CPU results identical:

float x, y;

…

x *= y;

x += .01;

## GPU & CPU results differ:

float x, y;

x *= y;

x += .01f

Really, though, I want the additive portion to not be a constant, but a variable. So, I try:

GPU & CPU results differ:

float x, y, z;

z = .01;

x *= y;

x += z;

## CPU & CPU results identical:

float x, y, z;

z = .01;

x = fmaf(x, y, z);

So, why not just use fmaf()? Well, it kills the performance on the GPU by more than 25% overall.

So, how can I simulate the built-in FMAD instruction on the host? Any other options I haven’t considered? Thanks!