About 64 bit number

half-0 · April 24, 2025, 7:19am

Will operate of 64-bit number cost significiantly long time then 32-bit number? For example, some address need 64-bits register to store. Will this actually use 2 registers? Or if I need to use hashmap on gpu, will the 64-bit key cost lot more time then 32-bit? (maybe 64-bit cmp cost 2 cycle while 32-bit cmp cost 1 cycle?)

njuffa · April 24, 2025, 8:00am

GPUs have a 32-bit architecture with 64-bit addressing capability. All general-purpose registers comprise 32 bits. Each operands of a 64-bit data type therefore requires two register. Operations on 64-bit integers are emulated. Simple operations like AND, OR, XOR, ADD, SUB, CMP require two 32-bit integer operations. This means their effective throughput is halved compared to the throughput of the corresponding 32-bit integer operations.

For more complex integer operation you may need to inspect the generated SASS (machine code) to assess how expensive exactly the emulation of 64-bit operations is. As a first-order estimate, 64-bit integer MUL may require four 32-bit integer IMUL / IMAD, and 64-bit integer DIV or MOD may require five to six times as many 32-bit integer operation as the corresponding 32-bit operations, which are themselves emulated (as there is no division hardware of any kind in GPUs).

If I recall correctly (check the Programming Guide), all floating-point operations involving 64-bit operands are handled by the double-precision units. This includes conversions between 64-bit integers and float. So for these operations a significant throughput reduction can be expected on non-datacenter GPUs compared to conversions between 32-bit integers and float.

half-0 · April 24, 2025, 8:13am

But if I need to turn a three dimention coordinates into a Z code, in most situations, I need to use 64-bit to store the Z code, which will cost lot of time. Is there any way to handle this situation?

Curefab · April 24, 2025, 8:29am

Just storing it doubles time for transfer and needed space. It depends on your allocation, whether this is much or trivial.

If you cannot shorten it to 32 bits, then there is probably no way around it.

You can in some cases distribute a 64 bit number on two threads. This would have the amount of registers needed per thread, but complicate handling a bit.

njuffa · April 24, 2025, 8:54am

I assume by “Z code” you mean a Morton code.

My advice would be to just go ahead and implement that. If you benchmark before and after, you will know the exact cost (which will likely be minor, but then I do not know your code).

I always advise against trying to make design decisions based on thought experiments. GPU are complex devices, and trying to predict the outcome of changes by paper analysis is brittle. So it is best to make assessments using actual experiments, in particular where they are not particularly costly to conduct, such as the introduction of a 64-bit 3D Morton code.

As a rough estimate, computing a 64-bit 3D Morton code from three 21-bit components should take about 72 integer instructions, while computing a 32-bit 3D Morton code from three 10-bit components should take on the order of 23 integer instructions. So the cost of computing a 64-bit Morton code is about 3x that of the 32-bit Morton code.

Curefab · April 24, 2025, 12:49pm

If computation time (instead of storage) is the issue, and you have to compute lots of Z/Morton codes, you could consider using the tensor cores and the bit interleaving variant. (I believe some steps of the problem could be expressed as 1-bit or 8-bit matrix operations and the Tensor Cores could be used advantageously in this case, without thinking all steps fully through).

njuffa · April 24, 2025, 6:09pm

In ordinary Morton code generation, bit-spreading for the purpose of interleaving often utilizes integer multiplies, which are fast on modern GPUs. Would tensor cores provide meaningful acceleration compared to that? I am doubtful but could easily be convinced by a worked example :-)

uint32_t deposit_every_third_shifted (uint32_t x, uint32_t s)
{
    const uint32_t sm1 = 0x333ul;           // source masks
    const uint32_t sm2 = 0x0ccul;
    const uint32_t ml1 = 0x50005ul << s;    // multipliers
    const uint32_t ml2 = 0x00500ul << s;
    const uint32_t ml3 = 0x05050ul << s;
    const uint32_t rm1 = 0x09009009ul << s; // result masks
    const uint32_t rm2 = 0x00240240ul << s; 
    uint32_t t = x & sm1;
    return ((((t * ml1) | (t * ml2)) & rm1) | (((x & sm2) * ml3) & rm2));
}

/* For x, y, z in [0, 1023], compute 30-bit 3D Morton code by interleaving bits.
   For i in [0, 9]: r<3*i+0> = x<i>, r<3*i+1> = y<i>, r<3*i+2> = z<i>.
*/
uint32_t encode_morton3d_10_bits (uint32_t x, uint32_t y, uint32_t z)
{
    return (deposit_every_third_shifted (x, 0) +
            deposit_every_third_shifted (y, 1) +
            deposit_every_third_shifted (z, 2));
}

With CUDA 12.8.1, when compiled for sm_89, the following SASS results:

encode_morton3d_10_bits(unsigned int, unsigned int, unsigned int):
 LOP3.LUT R0, R4, 0x333, RZ, 0xc0, !PT 
 LOP3.LUT R7, R5.reuse, 0x333, RZ, 0xc0, !PT 
 LOP3.LUT R9, R6, 0x333, RZ, 0xc0, !PT 
 IMAD R3, R0, 0x50005, RZ 
 LOP3.LUT R4, R4, 0xcc, RZ, 0xc0, !PT 
 IMAD R0, R0, 0x500, RZ 
 LOP3.LUT R5, R5, 0xcc, RZ, 0xc0, !PT 
 IMAD R8, R7, 0xa00, RZ 
 LOP3.LUT R6, R6, 0xcc, RZ, 0xc0, !PT 
 IMAD R10, R9, 0x1400, RZ 
 LOP3.LUT R0, R3, 0x9009009, R0, 0xc8, !PT 
 IMAD R3, R7, 0xa000a, RZ 
 IMAD R7, R9, 0x140014, RZ 
 IMAD R5, R5, 0xa0a0, RZ 
 LOP3.LUT R8, R3, 0x12012012, R8, 0xc8, !PT 
 IMAD R3, R4, 0x5050, RZ 
 LOP3.LUT R7, R7, 0x24024024, R10, 0xc8, !PT 
 IMAD R6, R6, 0x14140, RZ 
 LOP3.LUT R4, R0, 0x240240, R3, 0xf8, !PT 
 LOP3.LUT R3, R8, 0x480480, R5, 0xf8, !PT 
 LOP3.LUT R0, R7, 0x900900, R6, 0xf8, !PT 
 LOP3.LUT R4, R0, R3, R4, 0xfe, !PT 
 RET.ABS.NODEC R20 0x0

If we can make do with a 60-bit 3D Morton code, where each individual input is restricted to values less than 2²⁰, we can compute it in 50 instructions (for sm_70 <= CC <= sm_89) using the following:

/* For x, y, z in [0, 2**20-1], compute 60-bit 3D Morton code by interleaving 
   bits. For i in [0, 19]: r<3*i+0> = x<i>, r<3*i+1> = y<i>, r<3*i+2> = z<i>.
*/
uint64_t encode_morton3d_20_bits (uint32_t x, uint32_t y, uint32_t z)
{
    uint32_t lo = encode_morton3d_10_bits (x & 0x3ff, y & 0x3ff, z & 0x3ff);
    uint32_t hi = encode_morton3d_10_bits (x >> 10, y >> 10, z >> 10);
    return ((uint64_t)hi << 30) | lo;
}

The most efficient way to compute a full 63-bit 3D Morton code is to just handle bit 20 of the inputs separately. The resulting SASS comprises 58 instructions for CC >= sm_70.

/* For x, y, z in [0, 2**21-1], compute 63-bit 3D Morton code by interleaving 
   bits. For i in [0, 20]: r<3*i+0> = x<i>, r<3*i+1> = y<i>, r<3*i+2> = z<i>.
*/
uint64_t encode_morton3d_21_bits (uint32_t x, uint32_t y, uint32_t z)
{
    uint64_t r;
    uint32_t hi, lo;
    lo = encode_morton3d_10_bits (x & 0x3ff, y & 0x3ff, z & 0x3ff);
    hi = encode_morton3d_10_bits (x >> 10, y >> 10, z >> 10);
    r = ((uint64_t)hi << 30) | lo;
    r |= (uint64_t)(((x >> 20) & 1) | ((y >> 19) & 2) | ((z >> 18) & 4)) << 60;
    return r;
}

Curefab · April 25, 2025, 7:23am

Hi Norbert,
using Tensor cores for it (and be faster) really seems not to promising after some thought.

But how about look-up tables?

We can create 32 copies of LUTs to prevent banking conflicts.

The best addressible part of the input numbers would be 8 bits so we can directly access them without masks.

We can have tables to spread the 8 bits into zero-padded 32 bits, perhaps 3 tables for an additional shift by 1 or 2 positions.

32 copies for lanes * 3 shifts * 256 input number * 4 byte = 96 KiB. Would work with new architectures. Otherwise without shift: 32 KiB.

We save the logic operations and multiplications, but have indexing operations. We can convert 8 bit at once and then use a 3-way-add.

Probably shared memory bandwidth could be the limiting factor at the end.

For combining 3*16 bits into 48 bits, we need 6 accesses or probably cycles shared between the 4 SM partitions.

I would guess it is faster (even for 3*24 bits) than 50 cycles.

[Perhaps for longer numbers 10-bit indices, but no stored shifts => 128 KiB LUT size. Too large except for datacenter since 8.0.]

njuffa · April 25, 2025, 7:45am

In my experience, LUTs are rarely the right answer on GPUs, where FLOPS and IOPS are “too cheap to meter”. But I have been proven wrong on that basic guidance at least once that I recall (some compression scheme prescribed by a video standard or some such), so by all means give it a try.

Topic		Replies	Views
Low utilization of warp per scheduler for bit pack operation (Morton Code) CUDA Programming and Performance	9	539	November 7, 2023
32-bit number multiplication CUDA Programming and Performance	23	20489	July 1, 2012
Question about 64 Bit Integer Performance CUDA Programming and Performance	12	8921	August 18, 2018
PDEP and PEXT functionality for CUDA CUDA Programming and Performance	0	735	October 28, 2023
32-bit multiplication and 64-bit registers CUDA Programming and Performance	6	6076	December 10, 2008
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8453	March 1, 2015
cost of clock64() CUDA Programming and Performance	18	6469	November 23, 2015
First kernel run is slower than succeeding CUDA Programming and Performance	9	2666	May 16, 2022
cramming 3 unsigned 20-21 bit floating point values into 64 bits? CUDA Programming and Performance	8	1717	July 3, 2015
Changing one bit in mask slows down code significantly CUDA Programming and Performance	12	721	January 20, 2022

About 64 bit number

Related topics