In case you’re wondering whether the *p was cast from uint8_t into something else, I tried:
tmp1 = tmp1 && (((uint8_t)(*(p))) & 0x80);
with same wrong results: it put ‘false’ into tmp1 as well (incorrect)
What is the difference between funcA & funcB (apart from the fact that they use temporaries), and why is CUDA giving wrong answers in funcA? (Note both these pieces of code give the correct (same, ‘true’) value for tmp1 when run in emulator mode).
The problem is: in both cases the & 0x80 is converted to a sign check, but in the second example the expansion from 8 to 16 bit is done unsigned and thus bit 7 does not correspond to the sign bit.
Neither code seems to come close to optimal though, a single shift and a single and (which in this case can be optimized away) should be enough - though I do not know the GPU asm well enough to be sure.
But either way, from looking at the generated code, my advise is: do not use the bool type if you want good performance and avoid binary operations as much as possible - do them via bit fiddling instead, even if the GPU and the compiler are quite bad at that, too.
Code to test (it does not matter if bool or int is used):
#include <stdint.h>
__global__ void a(int *d, const uint8_t *p) {
int t = 1;
t = t && (*p & 0x80);
d[0] = t;
}