I am trying to find a higher throughput bit conditional inverse scheme:
// int a, b, c, d; d = (a & (1 << b)) ? c : ~c;
Here’s what the compiler currently does:
bfe.u32 d, a, b, 1 add.s32 d, d, -1 xor.b32 d, d, c
Quite clever I’ll have to admit, but are there faster representations? (Invert c and then selp is slower if the latency is hidden to a degree)
(I’m using CUDA 7.5RC and targeting sm_52)
EDIT: Please see post #10 below for more information.
EDIT2: I’ve open-sourced my implementation. See post #48.