Low utilization of warp per scheduler for bit pack operation (Morton Code)

njuffa · October 23, 2023, 10:45am

To illustrate what I had in mind, here is some code I quickly hacked together on an x86 platform, where it works well. The basic building block is a function that spreads the least significant 10 bits of an integer a into a result r as follows:

for i in [0,9]:
r_3i+0 = a_i
r_3i+1 = 0
r_3i+2 = 0

A reference function would look like this:

unsigned int spread_ten_bits_by_3_ref (unsigned int a)
{
    unsigned int r = 0;
    r |= (((a >> 0) & 1) <<  0);
    r |= (((a >> 1) & 1) <<  3);
    r |= (((a >> 2) & 1) <<  6);
    r |= (((a >> 3) & 1) <<  9);
    r |= (((a >> 4) & 1) << 12);
    r |= (((a >> 5) & 1) << 15);
    r |= (((a >> 6) & 1) << 18);
    r |= (((a >> 7) & 1) << 21);
    r |= (((a >> 8) & 1) << 24);
    r |= (((a >> 9) & 1) << 27);
    return r;
}

A version optimized for x86 (11 instructions, 2-way instruction level parallelism) might look like this:

unsigned int spread_ten_bits_by_3 (unsigned int a)
{
    return (((((a & 0x333) * 0x50005) | ((a & 0x333) * 0x00500)) & 0x09009009) |
            (((a & 0x0cc) * 0x05050) & 0x00240240));

}

We can use this as a building block for a 30-bit 3D Morton code, based on the ten least significant bits of each input (with other bits assumed 0):

unsigned int morton3d_ten_bits (unsigned int x, unsigned int y, unsigned int z)
{
    return ((spread_ten_bits_by_3 (x) << 0) |
            (spread_ten_bits_by_3 (y) << 1) |
            (spread_ten_bits_by_3 (z) << 2));    
}

We can also produce a 60-bit 3D Morton code based on the 20 least significant bits of each input. If we expect many small inputs, we can provide an early-out:

unsigned long long int morton3d_twenty_bits (unsigned int x, 
                                             unsigned int y, 
                                             unsigned int z)
{
    unsigned int lo, hi = 0;
    int bits_needed = 32 - clz (x | y | z);
    lo = morton3d_ten_bits (x & 0x3ff, y & 0x3ff, z & 0x3ff);
    if (bits_needed > 10) {
        hi = morton3d_ten_bits ((x >> 10) & 0x3ff, (y >> 10) & 0x3ff, 
                                (z >> 10) & 0x3ff); 
    }
    return ((unsigned long long int)hi << 30) | lo;
}

This gives us some idea how many instructions we might expect for a 60-bit Morton code: around a hundred depending on processor architecture. This explains why you found that your code is mostly bound by compute throughput. The optimal decomposition of a 3D Morton code into building blocks likely depends on the GPU architecture. There are trade-offs between the integer operations used, potential use of small tables, etc.

Topic		Replies	Views
About 64 bit number CUDA Programming and Performance	8	52	April 25, 2025
What does LOP3.LUT mean? How is it executed? CUDA Programming and Performance	22	4372	February 8, 2024
32-bit number multiplication CUDA Programming and Performance	23	20501	July 1, 2012
error in modulo operation CUDA Programming and Performance	12	16101	September 20, 2009
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7707	January 2, 2018
matrix multiply reduction CUDA Programming and Performance	41	35555	January 15, 2011
Yet another performance question CUDA Programming and Performance	16	4229	February 12, 2009
Why my ldmatrix PTX instruction is wrong? CUDA Programming and Performance cuda	10	766	March 30, 2024
Reduction CUDA Programming and Performance	19	3469	May 16, 2012
High ALU utilization, how to improve CUDA Programming and Performance	6	1803	November 27, 2014

Low utilization of warp per scheduler for bit pack operation (Morton Code)

Related topics