PTX: setp troubles Bug, or just me doing something wrong?

Edit 4: Not solved. I even converted to indexing by floats inside my application.

ld.const.f32 v, [value];

st.global.f32 [dbg], v;

setp.gt.f32 p, v, 2.;

@p bra END;

still returns {2., 0.}, meaning that ‘p’ was incorrectly set to true.

mov.f32 v, 2.;

st.global.f32 [dbg], v;

setp.gt.f32 p, v, 2.;

@p bra END;

returns {2., 1234.}, meaning that p was this time correctly set to false. Note, however, that the contents of dbg[0] are bit-for-bit identical in both cases, meaning that the contents of register ‘v’ should also be identical. Please help if you can!

Original post:

I’m writing some PTX by hand, and I’ve hit on a strange problem.

Let’s say I have the following. (This is a code snippet, and I haven’t set up the infrastructure to test it, but I have tested the code in the middle of a larger bit of PTX. I imagine the full context may be important; I’m happy to attach or email the code.)

.version 1.2

.target sm_11, map_f64_to_f32

.const .u32 value;

.global .u32 dbg[2];

.entry main

{

.reg .u32 v;

.reg .pred p;

ld.const.u32 v, [value];

st.global.u32 [dbg], v;

setp.eq.u32 p, v, 2;

@p bra.uni END;

st.global.u32 [dbg+4], 1234;

END:

bar.sync 0;

exit;

}

I copy 2 into ‘value’ using cuMemcpyHtoD, zero ‘dbg’ on the device, run the kernel, then copy ‘dbg’ to host for reading. No matter what I do, the predicate appears to have the opposite value of what I expect. For instance, in the above code, ‘dbg’ contains {2, 1234}. Changing the setp and bra lines to:

setp.eq.u32 p, v, 2;

@!p bra.uni END;

gives {2, 0}.

setp.ne.u32 p, v, 2;

@p bra.uni END;

gives {2, 1234}.

.reg .u32 t1;

mov.u32 t1, 0xffffffff;

setp.le.u32 p, v, t1;

@p bra.uni END;

gives {2, 1234} - which is theoretically impossible no matter what the value of ‘v’. Similarly,

setp.gt.u32 p, v, 0xffffffff;

@p bra.uni END;

gives {2, 0}. Again, impossible.

But here’s the thing:

mov.u32 v, 2;

setp.eq.u32 p, v, 2;

@p bra.uni END;

gives {2, 0}, which is what you’d expect. In fact, explicitly setting the value of ‘v’ suddenly makes every single test above, and many others I’ve tried, work correctly. Any ideas?

Edit: Aha! More info. Changing the ld. line to:

mov.u32 v, 0;

ld.const.u32 v, [value];

ALSO produces correct results! It should be noted that I manipulate a few u64s in the PTX before this code. Could it be possible that ld sets only the lower 32 bits of the register, leaving the upper bits unchanged, but that setp is comparing based on the full 64-bit maximum width? If the mov.u32 zeros the upper bits of the register, this would explain the behavior.

Thanks for your help.

Oh, and system info:

PTXAS version V0.2.1221, CUDA version 2.0 beta 2 (I think)

NVIDIA driver version 177.13

Gentoo Linux AMD64, running on c2d 6400 @ 2.35 GHz, 4G RAM, P965 chipset, GeForce 8800 GTS. Compiz is running but many other tests seem cool with that.

Just noticed there are new versions out. Will test again with those.

Edit 2: Don’t know if it was the new version, the reboot, or a flaw in my testing methodology, but the fix mentioned above - zeroing the destination register before loading - no longer produces correct results. I have no idea what the problem is.

Edit 3: Solved, I think. It seems that you can’t follow an ‘ld’ with a ‘setp’ on the same register directly; you must perform an operation on that register first. The trick is finding one that can’t be optimized out. In my case, ‘and.b32 v, v, 16;’ between the ld and the setp made it behave correctly. Independent confirmation would be appreciated!

How did you fill (actually, how do you think your compiler fills) address of Value with 2.0 (float)? If you pass value as 0x00000002 that is 2 but as unsigned integer and must be load with ld.u32 into integer register then converted with cvt into float register. If you pass 0x40000000 (that is 2.0 float in hexadecimal interpretation) then it should be OK. You didn’t provide your host code so I’m guessing.

Second thing about parts where you played with integers, if you need comparisons of UNSIGNED types always use LO,LS,HI,HS . (LT,LE,GT,GE do signed comparison meaning value of 0xFFFFFFFF from your example is negative)

I did make sure that the value of 2.0f was 0x40000000 on both host and device. WRT .lo and friends,