NVCC pattern matching for popc


I’ve been playing around with Numba and the llvm backend is able to generate popcnt for x64 backend from:

@_nb.njit  # compiles to popcntq https://bugs.llvm.org/show_bug.cgi?id=1488
def _popcnt64(x):
    c = 0
    while x:
        x &= x - _nb.u8(1)
        c += 1
return c

but @cuda.jit is unable to generate an equivalent popc instruction. Any plans to add equivalent IR pattern matching optimizations in NVCC?


there are already popcnt intrinsics available in CUDA.

NVIDIA doesn’t maintain llvm or numba.

If you’re asking for a compiler idiom to automatically convert a code sequence in CUDA C++ to a popcnt, it’s probably best to file that as a RFE/bug as a registered developer at developer.nvidia.com

If you’re asking for something specific to llvm or numba, probably best to file as an issue at the appropriate place for those.

The relevant intrinsics are listed in the CUDA documentation, here:


__device__ ​ int __popc ( unsigned int  x )
    Count the number of bits that are set to 1 in a 32 bit integer. 
__device__ ​ int __popcll ( unsigned long long int x )
    Count the number of bits that are set to 1 in a 64 bit integer.

Access via PTX is described in the PTX manual, here:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-popc Integer Arithmetic Instructions: popc

Population count.

popc.type  d, a;

.type = { .b32, .b64 };


Count the number of one bits in a and place the resulting population count in 32-bit destination register d. Operand a has the instruction type and destination d has type .u32.