Temporary uint8_t's... CUDA not giving right answers unless

The following code doesn’t give the right answer for tmp1 (i’m expecting ‘true’):

__device__ void funcA(const uint8_t* p) // p[0] = 0x80;

{

    bool tmp1 = true;

    tmp1 = tmp1 && (*(p) & 0x80); // gives tmp1 = false (incorrect)

}

BUT, if I use a temporary, it does give the right answer:

__device__ void funcB(const uint8_t* p) // p[0] = 128;

{

    bool tmp1 = true;

    uint8_t tmp = *(p); // tmp = 128;

    tmp1 = tmp1 && (tmp & 0x80); // gives tmp1 = true (correct)

}

In case you’re wondering whether the *p was cast from uint8_t into something else, I tried:

   tmp1 = tmp1 && (((uint8_t)(*(p))) & 0x80);

with same wrong results: it put ‘false’ into tmp1 as well (incorrect)

What is the difference between funcA & funcB (apart from the fact that they use temporaries), and why is CUDA giving wrong answers in funcA? (Note both these pieces of code give the correct (same, ‘true’) value for tmp1 when run in emulator mode).

Any insights shed would be useful,

Thanks in advance,

-A

compiler bug. Just compiler with nvcc --ptx and you will see (tested with CUDA 2.0beta2 nvcc):

second variant becomes (parts of code left out):

ld.global.u8 %rh1, [%rd2+0];

cvt.s8.s32 %r2, %r1;

mov.s32 %r3, 0;

set.lt.u32.s32 %r4, %r2, %r3;

while the first one becomes:

ld.global.u8 %rh1, [%rd1+0];

mov.s16 %rh2, %rh1;

mov.u16 %rh3, 0;

setp.ge.s16 %p1, %rh2, %rh3;

The problem is: in both cases the & 0x80 is converted to a sign check, but in the second example the expansion from 8 to 16 bit is done unsigned and thus bit 7 does not correspond to the sign bit.

Neither code seems to come close to optimal though, a single shift and a single and (which in this case can be optimized away) should be enough - though I do not know the GPU asm well enough to be sure.

But either way, from looking at the generated code, my advise is: do not use the bool type if you want good performance and avoid binary operations as much as possible - do them via bit fiddling instead, even if the GPU and the compiler are quite bad at that, too.

Code to test (it does not matter if bool or int is used):

#include <stdint.h>

__global__ void a(int *d, const uint8_t *p) {

    int t = 1;

    t = t && (*p & 0x80);

    d[0] = t;

}

Very useful, thanks!