From Shuffle: Tips and Tricks document.
int swap(int x, int mask, int dir)
{
int y = __shfl_xor(x, mask);
return x < y == dir ? y : x;
}
x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); // 2
…
Also, if the laneid is already available in a register, I prefer to use it directly from C (like threadIdx) instead of wasting another register.
I am not aware of __bfe() and __bfi() intrinsics. You may want to file an RFE (“request for enhancement”) with NVIDIA to have them added in future CUDA versions. RFEs can be filed through the bug reporting form, simply prefix the synopsis with “RFE:” to mark it as an enhancement request.
Since the only thing bfe(int i, int k) does is returning the k-th bit of int i, I wrote my own version with
```
__device__ __forceinline__ int bfe(int x, int y)
{
int result = !!(x&(1<<y));
return result;
}
```
It would be great to hear feedback about this, if the implementation is wrong, but for my intended purpose it works fine.
The BFE and BFI instructions deal with bit fields (groups of multiple consecutive bits), of which handling just one bit is a special case. If you just want to extract bit ‘i’ from a 32-bit integer ‘a’, why not use
// return i-th bit of a
int extract_bit (unsigned int a, int i)
{
return (a >> i) & 1;
}
Obviously, that doesn’t include checks for an out-of-bounds value of ‘i’, which you may want to add if you don’t like the behavior that falls out of the above code.
You cannot prevent subsequent posters from re-utilizing existing threads. And in all fairness, although you never explained the semantics of the bfe() function used in your original post, it does look like the extraction of single bits to me: