To illustrate what I had in mind, here is some code I quickly hacked together on an x86 platform, where it works well. The basic building block is a function that spreads the least significant 10 bits of an integer a
into a result r
as follows:
for i in [0,9]:
r3i+0 = ai
r3i+1 = 0
r3i+2 = 0
A reference function would look like this:
unsigned int spread_ten_bits_by_3_ref (unsigned int a)
{
unsigned int r = 0;
r |= (((a >> 0) & 1) << 0);
r |= (((a >> 1) & 1) << 3);
r |= (((a >> 2) & 1) << 6);
r |= (((a >> 3) & 1) << 9);
r |= (((a >> 4) & 1) << 12);
r |= (((a >> 5) & 1) << 15);
r |= (((a >> 6) & 1) << 18);
r |= (((a >> 7) & 1) << 21);
r |= (((a >> 8) & 1) << 24);
r |= (((a >> 9) & 1) << 27);
return r;
}
A version optimized for x86 (11 instructions, 2-way instruction level parallelism) might look like this:
unsigned int spread_ten_bits_by_3 (unsigned int a)
{
return (((((a & 0x333) * 0x50005) | ((a & 0x333) * 0x00500)) & 0x09009009) |
(((a & 0x0cc) * 0x05050) & 0x00240240));
}
We can use this as a building block for a 30-bit 3D Morton code, based on the ten least significant bits of each input (with other bits assumed 0):
unsigned int morton3d_ten_bits (unsigned int x, unsigned int y, unsigned int z)
{
return ((spread_ten_bits_by_3 (x) << 0) |
(spread_ten_bits_by_3 (y) << 1) |
(spread_ten_bits_by_3 (z) << 2));
}
We can also produce a 60-bit 3D Morton code based on the 20 least significant bits of each input. If we expect many small inputs, we can provide an early-out:
unsigned long long int morton3d_twenty_bits (unsigned int x,
unsigned int y,
unsigned int z)
{
unsigned int lo, hi = 0;
int bits_needed = 32 - clz (x | y | z);
lo = morton3d_ten_bits (x & 0x3ff, y & 0x3ff, z & 0x3ff);
if (bits_needed > 10) {
hi = morton3d_ten_bits ((x >> 10) & 0x3ff, (y >> 10) & 0x3ff,
(z >> 10) & 0x3ff);
}
return ((unsigned long long int)hi << 30) | lo;
}
This gives us some idea how many instructions we might expect for a 60-bit Morton code: around a hundred depending on processor architecture. This explains why you found that your code is mostly bound by compute throughput. The optimal decomposition of a 3D Morton code into building blocks likely depends on the GPU architecture. There are trade-offs between the integer operations used, potential use of small tables, etc.