As GPU float division is not IEEE-compliant and accumulates significant error I’d like to zero all fractional digits after the 6th one each time division occures (at least this should eleminate bad results that are hard to filter out).

For example: source value 0.123456111 should become 0.123456.
The fastest way is to manipulate bits directly I think.

The programming guide says that division is good to 2 ULP, so that shouldn’t cause any significant errors in bits of output. The not-IEEE part is mostly about handling divides by 0, infinities, NaNs, etc.

Can you give an example where the divide gives the wrong answer?

The specific answer to your question is just a bit mask, look at the IEEE FP format, you can cast the raw float bit representation as an int then mask out the last bits if you like. A C union, or just raw ugly type casts can do this.

float a=1234.34454343446f;
// mask out last 8 bits of FP value
*((unsigned int *)&a) &= 0xFFFFFF00;

Division itself is not too big problem - the problem appears when divisions are nested (a / b * b / a * a / a is not always 1.0), and my task involves lots of such expressions. Sometimes the result may be not 1.0 but, say, 1.00000012345 and then, when I subtract 1.0 from 1.00000012345 I receive very small value but should receive zero.

As my main goal is to achieve same results for expression being evaluated on CPU and GPU, I need a way to make division work similarly on both CPU and GPU (when I don’t use division in my expressions, but only + - * operations, results are perfectly matched). Thus, I’m ready to truncate the result of division on both CPU and GPU in order to get the same division result.

You propose zeroing of 16 bits of fraction part. Why 16 ??

I believe that zeroing won’t work at all … fraction consists of bits and each bit represents (1/2)^n. Error truncation is possible via total fraction rearrangement, the problem is the speed of rearrangement.

So your problem isn’t a CUDA issue, it’s just the limited precision of IEEE single precision floating point? If your computation is so dependent on the very lowest bits of a floating point value, you’re going to run into problems on CPUs, GPUs, pretty much everywhere.

If CUDA’s 2ULP accuracy really is the problem, you could improve that by splitting your divide out to do a reciprocal first, which is 1ULP. And it’d be complete overkill, but a Newton iteration would probably make the error 0ULP.

// compute x/y
float a, b, c;
a=x/y; // 2ULP error
b=1.0f/y; // 1ULP error
b=b*x;
c=1.0f/y;
c=c*(2.0f-y*c); // Newton iteration for reciprocal
c=c*x;

But I stress again, if you’re relying on those last bits of a float, that’s your real problem, not the divide accuracy! This rule is perhaps the most important practical guideline of all numeric computing.

That code snippet zeros out the last 8 bits, since you were asking how to zero out low fractional bits. You could change it to 4 by using a mask constant of 0xFFFFFFF0 or 10 by using 0xFFFFFFC0, etc.

But again I don’t think this is what you want to do.

The main goal is not to achieve absolute accuracy - but to get same (or approximately same) results on CPU and GPU, 1/x and newton iterations are helpful, thank you!