bfe, bfi, laneid

I was wondering how can I use these on a sm_30, __bfe() does not work.
Cannot find the intrinsic for them at http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html#group__CUDA__MATH__INTRINSIC__INT

From Shuffle: Tips and Tricks document.
int swap(int x, int mask, int dir)
{
int y = __shfl_xor(x, mask);
return x < y == dir ? y : x;
}

x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); // 2

Also, if the laneid is already available in a register, I prefer to use it directly from C (like threadIdx) instead of wasting another register.

Thanks

You could try the wrapper functions in this code:

https://github.com/NVlabs/moderngpu/blob/master/include/device/intrinsics.cuh

I am not aware of __bfe() and __bfi() intrinsics. You may want to file an RFE (“request for enhancement”) with NVIDIA to have them added in future CUDA versions. RFEs can be filed through the bug reporting form, simply prefix the synopsis with “RFE:” to mark it as an enhancement request.

So maybe this is a bit late to answer that question, but I ran into the same problem and here’s how I solved it:

It would be great to hear feedback about this, if the implementation is wrong, but for my intended purpose it works fine.

Best,

David

The BFE and BFI instructions deal with bit fields (groups of multiple consecutive bits), of which handling just one bit is a special case. If you just want to extract bit ‘i’ from a 32-bit integer ‘a’, why not use

// return i-th bit of a
int extract_bit (unsigned int a, int i)
{
    return (a >> i) & 1;
}

Obviously, that doesn’t include checks for an out-of-bounds value of ‘i’, which you may want to add if you don’t like the behavior that falls out of the above code.

The question was asking how to use the specific machine instruction bfe, not the trivial task of extracting bit(s) from an integer!!!

You cannot prevent subsequent posters from re-utilizing existing threads. And in all fairness, although you never explained the semantics of the bfe() function used in your original post, it does look like the extraction of single bits to me:

x = swap(x, 0x01, bfe(laneid, 1) ^ bfe(laneid, 0)); // 2

Worked example implementations for bfe() and bfi() functions with different signatures can be found here:

https://github.com/moderngpu/moderngpu/blob/master/src/moderngpu/intrinsics.hxx

Ah, thanks njuffa, your solution is a lot more elegant. I’ll happily steal it. ;) Also thanks for the github link, that’s really interesting.

Sorry Vectorizer, I misread the intention of your post.