Hello
I have following c++ code that runs in single 1-dimensional block.
Variant A:
[codebox]
double2 b;
double2 d;
…
__syncthread();
if (threadIdx.x == 0)
{
d.y = 1. / (b.x * b.x + b.y * b.y);
b.x *= d.y;
b.y *= - d.y;
}
__syncthread();
…
[/codebox]
that compiled to ptx:
[codebox]
rcp.rn.f64 %fd42, %fd41;
.loc 2 1005 0
mul.f64 %fd5, %fd42, %fd5;
.loc 2 1006 0
mul.f64 %fd43, %fd42, %fd6;
neg.f64 %fd6, %fd43;
[/codebox]
Variant B:
[codebox]
double2 b;
double2 d;
…
__syncthread();
if (threadIdx.x == 0)
{
d.y = b.x * b.x + b.y * b.y;
b.x /= d.y;
b.y /= - d.y;
}
__syncthread();
…
[/codebox]
that compiled to ptx:
[codebox]
div.rn.f64 %fd5, %fd5, %fd41;
.loc 2 1006 0
neg.f64 %fd42, %fd41;
div.rn.f64 %fd6, %fd6, %fd42;
[/codebox]
It.s the same, right? But their results have different sign.
Variant B return negative b.x, where original b.x was positive and d.y is differently positive (modulus).
The same on Windows XP 32 and 64, CUDA 2.2 and 2.3.
The workaround is variant A, however, where is a bug?
Thanks in advance