A decade ago the BMI2 instruction set extension to the x86-64 architecture added, among other instructions, two instructions for gathering (PDEP = parallel bit deposit) and scattering (PEXT = parallel bit extract) bits from a source operand `src`

under the control of a mask `msk`

that identifies which bits are to be extracted, or into which bits data is deposited.

There are numerous uses for these instructions, among them the kind of bit-interleaving typical of Morton codes that were the topic of a recent forum thread. Emulating the functionality of these two instructions in full generality in software is neither trivial nor particularly fast so that in general I would recommend finding software approaches tailored to a particular use case. I gave a demonstration of the efficient computation of a 30-bit 3D Morton code in CUDA in that recent thread.

But I got curious how one might go about implementing the full functionality of PDEP and PEXT where the use of such functionality would be expedient (if only in the short term), maybe when porting code from a platform that does support it. Emulation approaches generally involev iterating over the 1-bits in the mask. One can scan the mask bit-wise, scanning from either lsb to msb, or msb to lsb. One can also scan contiguous group of 1-bits at a time. Clearly there are trade-offs based on data: Scanning by bit group is more expensive than scanning bitwise, with the worst case being a mask where very bit group is one bit long (e.g. mask value of 0x55555555 or 0xAAAAAAAA). Conversely, if a mask comprises only a few fairly wide fields, scanning by bit group can be much faster than scanning bitwise.

I did some performance measurements on a Turing family GPU, and would expect the results to be reasonably representative for all compute capabilities >= 7.0. For *randomly-distributed* data I have found that the bit-group approach is about 10% faster than the bitwise approach; but then masks are hardly ever random in real use cases. In general, bit-wise scanning from lsb to msb is at least as fast or faster than scanning from msb to lsb. The reason is that scanning from msb to lsb involves more expensive operations such as shifts and counting leading zeros.

I apologize for the lack of commenting in the code below. I got tired of annotating the code after the first implementation. The code should be largely understandable based on the variable names. If someone is aware of more efficient approaches, I would love to see them, in particular any involving creative (ab)use of floating-point hardware or integer multiplies. I am aware of the parallel suffix implementations in H. Warren’s *Hacker’s Delight*, but fully unrolled they would *seem* to be closer to the worst case execution time of the loop-based code below based on the sheer number of operations involved. I have not yet confirmed that initial assessment experimentally.

```
/*
Copyright 2023, Norbert Juffa
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/
#define PROCESS_BITWISE (1)
#define PROCESS_LSB_TO_MSB (1)
__device__ uint32_t my_pdep_u32 (uint32_t src, uint32_t msk)
{
#if PROCESS_BITWISE
#if PROCESS_LSB_TO_MSB
uint32_t msklsb, extbit, srcbitmsk, res;
srcbitmsk = 1;
res = 0;
while (msk) {
msklsb = msk & (0 - msk); // isolate rightmost 1-bit in mask
extbit = 0 - (src & srcbitmsk); // isolate bit in src and left-extend it
res = res | (msklsb & extbit); // transfer bit to position of mask lsb
msk = msk ^ msklsb; // turn off rightmost 1-bit in mask
srcbitmsk += srcbitmsk; // advance to next (more signficant) bit
}
return res;
#else // process mask from msb to lsb
uint32_t res, tz, shft;
res = 0;
if (msk) {
src = src << (32 - __popc (msk));
tz = __ffs (msk) - 1;
while (msk) {
shft = __clz (msk);
msk = msk << shft;
res = res << (shft - 1);
res = (res << 1) | (src >> 31);
src = src + src;
msk = msk & ~(1 << 31);
}
res = res << tz;
}
return res;
#endif // PROCESS_LSB_TO_MSB
#else // process mask by bit group
uint32_t oldmsk, bitgrp, bitgrplsb, grplen, lsbpos, shift, depcnt, res;
depcnt = 0;
res = 0;
while (msk) {
oldmsk = msk;
bitgrplsb = (msk & (0 - msk));
msk = (bitgrplsb + msk) & msk;
bitgrp = msk ^ oldmsk;
lsbpos = 31 - __clz (bitgrplsb);
grplen = __popc (bitgrp);
shift = lsbpos - depcnt;
depcnt = depcnt + grplen;
res = res | ((src << shift) & bitgrp);
}
return res;
#endif // PROCESS_BITWISE
}
__device__ uint32_t my_pext_u32 (uint32_t src, uint32_t msk)
{
#if PROCESS_BY_BIT
#if PROCESS_LSB_TO_MSB
uint32_t msklsb, srcbit, extbit, resbitmsk, res;
resbitmsk = 1;
res = 0;
while (msk) {
msklsb = msk & (0 - msk);
srcbit = msklsb & src;
extbit = srcbit ? resbitmsk : 0;
res = res | extbit;
msk = msk ^ msklsb;
resbitmsk += resbitmsk;
}
return res;
#else // process mask from msb to lsb
uint32_t shft, res;
res = 0;
while (msk) {
shft = __clz (msk);
src = src << shft;
res = (res << 1) | (src >> 31);
msk = msk << shft;
msk = msk & ~(1 << 31);
}
return res;
#endif // PROCESS_LSB_TO_MSB
#else // process mask by bit group
uint32_t oldmsk, bitgrp, bitgrplsb, lsbpos, grplen, shift, extcnt, res;
extcnt = 0;
res = 0;
while (msk) {
oldmsk = msk;
bitgrplsb = (msk & (0 - msk));
msk = (bitgrplsb + msk) & msk;
bitgrp = msk ^ oldmsk;
lsbpos = 31 - __clz (bitgrplsb);
grplen = __popc (bitgrp);
shift = lsbpos - extcnt;
extcnt = extcnt + grplen;
res = res | ((bitgrp & src) >> shift);
}
return res;
#endif // PROCESS_BITWISE
}
```