I need an algorithm running on the host to generate the same results as one running on the GPU, and I’m getting odd results with the following. I have identical code running on the CPU and the GPU, and the segments and results are:
GPU & CPU results identical:
float x, y;
…
x *= y;
x += .01;
GPU & CPU results differ:
float x, y;
x *= y;
x += .01f
Really, though, I want the additive portion to not be a constant, but a variable. So, I try:
GPU & CPU results differ:
float x, y, z;
z = .01;
x *= y;
x += z;
CPU & CPU results identical:
float x, y, z;
z = .01;
x = fmaf(x, y, z);
So, why not just use fmaf()? Well, it kills the performance on the GPU by more than 25% overall.
So, how can I simulate the built-in FMAD instruction on the host? Any other options I haven’t considered? Thanks!