NVCC pattern matching for popc

steve63x30 · August 14, 2018, 9:50pm

Hi,

I’ve been playing around with Numba and the llvm backend is able to generate popcnt for x64 backend from:

@_nb.njit  # compiles to popcntq https://bugs.llvm.org/show_bug.cgi?id=1488
def _popcnt64(x):
    c = 0
    while x:
        x &= x - _nb.u8(1)
        c += 1
return c

but @cuda.jit is unable to generate an equivalent popc instruction. Any plans to add equivalent IR pattern matching optimizations in NVCC?

Thanks,
Steve

Robert_Crovella · August 14, 2018, 10:06pm

there are already popcnt intrinsics available in CUDA.

NVIDIA doesn’t maintain llvm or numba.

If you’re asking for a compiler idiom to automatically convert a code sequence in CUDA C++ to a popcnt, it’s probably best to file that as a RFE/bug as a registered developer at developer.nvidia.com

If you’re asking for something specific to llvm or numba, probably best to file as an issue at the appropriate place for those.

njuffa · August 14, 2018, 11:22pm

The relevant intrinsics are listed in the CUDA documentation, here:

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html

__device__  int __popc ( unsigned int  x )
    Count the number of bits that are set to 1 in a 32 bit integer. 
__device__  int __popcll ( unsigned long long int x )
    Count the number of bits that are set to 1 in a 64 bit integer.

Access via PTX is described in the PTX manual, here:

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-popc

9.7.1.14. Integer Arithmetic Instructions: popc
popc

Population count.
Syntax

popc.type  d, a;

.type = { .b32, .b64 };

Description

Count the number of one bits in a and place the resulting population count in 32-bit destination register d. Operand a has the instruction type and destination d has type .u32.

Topic		Replies	Views
pop-count operations on GPUs CUDA Programming and Performance	5	27567	February 21, 2008
nvcc not recognising "_popcnt64" intrinsic from Intel compiler CUDA Programming and Performance	5	1091	October 25, 2019
Does __popc() or __popcll() count '1' from 128-bit size data? CUDA Programming and Performance	2	2858	March 3, 2014
functionally identical __device__ functions return differing values. CUDA Programming and Performance	3	764	April 9, 2019
Bitwise matrix operation CUDA Programming and Performance	3	1186	December 29, 2023
intrinsic bitwise function PGI fortran Legacy PGI Compilers	5	2927	April 1, 2020
Are the intrinsics listed anywhere? CUDA Programming and Performance	3	411	February 7, 2023
Intrinsic __popcll not working in OptiX 7.4 OptiX	6	690	June 15, 2022
Problem with CUDA release 4.1, using default LLVM compiler CUDA Programming and Performance	7	5541	February 25, 2012
OpenCL linux header files; OpenCL status CUDA Programming and Performance	4	1821	April 14, 2015

NVCC pattern matching for popc

Related topics