Why are 64 bit integer operations broken into 2 32 bit ops?

Hi All,

I have been running a small micro benchmark to see the 64 bit integer arithmetic throughput on Fermi. (GTX480)

// Dummy code

typdef unsigned long long int uint64;

#define PERIOD 1317624576693539401L

__global__ myKernel() {

uint64 n = 0;

n += PERIOD; // Dis-assembly of this reported below

}

On Dis-assembly this increment code becomes

add b32 $c0 $r20 $r20 0x49249249 

add b32 $r21 $r21 0x12492492 $c0

// Hex of PERIOD is 0x1249249249249249

When I looked at the PTX ISA 2.2 (Section 8.7.1) the add instruction can operate on u64, s64 etc.

So, I don’t understand why the compiler does not use it. Do I need to use any special compiler flag for this?

Thanks

embeddeduser

PTX is not the native assembler language of the GPU (which is not documented), but gets translated again. As the native instruction set does not support 64 bit operations (apart from [font=“Courier New”]double[/font] arithmetic), it makes little difference whether the split into two 32 bit operations is done before or after the PTX stage.

Tera’s correct.

But a side question:

Are you sure that’s the correct dissassembly? The second add should be an addc to properly account for carry from the first word’s sum. The first add should also be an add.cc to properly generate the carry.

I assume this is what $c0 indicates? In the first instruction it’s on the left-hand (output) side, i.e. carry out. In the second instructions it’s on the right-hand (input) side, i.e. carry in. I am not familiar with this output format, but it seems like a reasonable assumption.

add b32 $c0 $r20 $r20 0x49249249
add b32 $r21 $r21 0x12492492 $c0

You are right, the $co indicates the carry out from the previous add. Also, this dis-assembly is not the PTX o/p but form the dis-assembly tool “nv50dis” → http://0x04.net/cgit/index.cgi/nv50dis/ (I hope the link works)

@terra: I was not aware that the native instruction set does not support 64bit arithmetic? Could you send me some links on this?

SPWorley is correct that you haven’t actually posted PTX, which explains why the instruction has been split already.

As I said, the native instruction set is not documented, and you already posted the only link about it I know (i.e., the source of nv50dis contains all publicly known information about the instruction set architecture).

Some random bits are on the according pscnv wiki page, but that’s about it.