bfe, bfi, laneid

Vectorizer · November 16, 2014, 6:19pm

I was wondering how can I use these on a sm_30, __bfe() does not work.
Cannot find the intrinsic for them at CUDA Math API :: CUDA Toolkit Documentation

From Shuffle: Tips and Tricks document.
int swap(int x, int mask, int dir)
{
int y = __shfl_xor(x, mask);
return x < y == dir ? y : x;
}

x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); // 2
…
Also, if the laneid is already available in a register, I prefer to use it directly from C (like threadIdx) instead of wasting another register.

Thanks

njuffa · November 16, 2014, 7:23pm

You could try the wrapper functions in this code:

[url]https://github.com/NVlabs/moderngpu/blob/master/include/device/intrinsics.cuh[/url]

I am not aware of __bfe() and __bfi() intrinsics. You may want to file an RFE (“request for enhancement”) with NVIDIA to have them added in future CUDA versions. RFEs can be filed through the bug reporting form, simply prefix the synopsis with “RFE:” to mark it as an enhancement request.

david.muramatsu · January 10, 2018, 6:17pm

So maybe this is a bit late to answer that question, but I ran into the same problem and here’s how I solved it:

The code where I ran into this is from http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf, p. 19. bfe refers to an integer arithmetic instruction from the parallelthread execution ISA as can be found here http://docs.nvidia.com/cuda/pdf/ptx_isa_3.1.pdf.
Since the only thing bfe(int i, int k) does is returning the k-th bit of int i, I wrote my own version with ``` __device__ __forceinline__ int bfe(int x, int y) { int result = !!(x&(1<<y)); return result; } ```

It would be great to hear feedback about this, if the implementation is wrong, but for my intended purpose it works fine.

Best,

David

njuffa · January 10, 2018, 6:28pm

The BFE and BFI instructions deal with bit fields (groups of multiple consecutive bits), of which handling just one bit is a special case. If you just want to extract bit ‘i’ from a 32-bit integer ‘a’, why not use

// return i-th bit of a
int extract_bit (unsigned int a, int i)
{
    return (a >> i) & 1;
}

Obviously, that doesn’t include checks for an out-of-bounds value of ‘i’, which you may want to add if you don’t like the behavior that falls out of the above code.

Vectorizer · January 10, 2018, 9:18pm

The question was asking how to use the specific machine instruction bfe, not the trivial task of extracting bit(s) from an integer!!!

njuffa · January 10, 2018, 9:29pm

You cannot prevent subsequent posters from re-utilizing existing threads. And in all fairness, although you never explained the semantics of the bfe() function used in your original post, it does look like the extraction of single bits to me:

x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); // 2

Worked example implementations for bfe() and bfi() functions with different signatures can be found here:

https://github.com/moderngpu/moderngpu/blob/master/src/moderngpu/intrinsics.hxx

david.muramatsu · January 10, 2018, 10:46pm

Ah, thanks njuffa, your solution is a lot more elegant. I’ll happily steal it. ;) Also thanks for the github link, that’s really interesting.

Sorry Vectorizer, I misread the intention of your post.