Low utilization of warp per scheduler for bit pack operation (Morton Code)

To illustrate what I had in mind, here is some code I quickly hacked together on an x86 platform, where it works well. The basic building block is a function that spreads the least significant 10 bits of an integer a into a result r as follows:

for i in [0,9]:
r3i+0 = ai
r3i+1 = 0
r3i+2 = 0

A reference function would look like this:

unsigned int spread_ten_bits_by_3_ref (unsigned int a)
{
    unsigned int r = 0;
    r |= (((a >> 0) & 1) <<  0);
    r |= (((a >> 1) & 1) <<  3);
    r |= (((a >> 2) & 1) <<  6);
    r |= (((a >> 3) & 1) <<  9);
    r |= (((a >> 4) & 1) << 12);
    r |= (((a >> 5) & 1) << 15);
    r |= (((a >> 6) & 1) << 18);
    r |= (((a >> 7) & 1) << 21);
    r |= (((a >> 8) & 1) << 24);
    r |= (((a >> 9) & 1) << 27);
    return r;
}

A version optimized for x86 (11 instructions, 2-way instruction level parallelism) might look like this:

unsigned int spread_ten_bits_by_3 (unsigned int a)
{
    return (((((a & 0x333) * 0x50005) | ((a & 0x333) * 0x00500)) & 0x09009009) |
            (((a & 0x0cc) * 0x05050) & 0x00240240));

}

We can use this as a building block for a 30-bit 3D Morton code, based on the ten least significant bits of each input (with other bits assumed 0):

unsigned int morton3d_ten_bits (unsigned int x, unsigned int y, unsigned int z)
{
    return ((spread_ten_bits_by_3 (x) << 0) |
            (spread_ten_bits_by_3 (y) << 1) |
            (spread_ten_bits_by_3 (z) << 2));    
}

We can also produce a 60-bit 3D Morton code based on the 20 least significant bits of each input. If we expect many small inputs, we can provide an early-out:

unsigned long long int morton3d_twenty_bits (unsigned int x, 
                                             unsigned int y, 
                                             unsigned int z)
{
    unsigned int lo, hi = 0;
    int bits_needed = 32 - clz (x | y | z);
    lo = morton3d_ten_bits (x & 0x3ff, y & 0x3ff, z & 0x3ff);
    if (bits_needed > 10) {
        hi = morton3d_ten_bits ((x >> 10) & 0x3ff, (y >> 10) & 0x3ff, 
                                (z >> 10) & 0x3ff); 
    }
    return ((unsigned long long int)hi << 30) | lo;
}

This gives us some idea how many instructions we might expect for a 60-bit Morton code: around a hundred depending on processor architecture. This explains why you found that your code is mostly bound by compute throughput. The optimal decomposition of a 3D Morton code into building blocks likely depends on the GPU architecture. There are trade-offs between the integer operations used, potential use of small tables, etc.