Where is the 32 bit int and 64 bit int address calculation done on CUDA hardware?

Hi,

I would like to know where exactly the 32 bit int and 64 bit int address calculation done on CUDA hardware. Is it LD/ST unit or CUDA ALU cores? I’m trying to understand how they are handling address calculation efficiently.

Thanks

The short answer is “both”. The details differ slightly based on GPU architecture, but the LD/ST instructions basically support a “register + offset” addressing mode, everything beyond that is accomplished by ALU computation. The compiler uses strength reduction where possible (loops in particular) and where it makes sense from a performance perspective.

The approach to address computation on the GPU is quite similar to that on many CPU architectures. x86 is a bit “special” in that it supports (limited) scaling as part of the address computation of load and store instructions.

What motivated the question? You can see the full details of the address computations by inspecting the machine code (SASS) for any binary, but disassembling with cuobjdump --dump-sass.