I’m trying to optimize this kernel to get down to the magic 16:1 ALU:mem cycle ratio. I thought I was clever using prmt (__byte_perm in CUDA) hacks to gather from register rather than shared memory. However, I’ve been noticing that both in the ISA and the PTX, nvcc generates AND masks prior to each prmt when the index argument (the 4th arg in PTX and the 3rd in ISA) is a register rather than an immediate. This is a mask with 0x7777 (30583 decimal, as printed in the PTX):
and.b32 %r515, %r514, 30583;
prmt.b32 %r516, %r505, %r513, %r515;
/06f8/ /0xdca01c036800c1dd/ LOP.AND R0, R10, 0x7777;
/0710/ /0x00c45c04241a0000/ PRMT R17, R12, R0, R13;
I’m guessing prmt is a half-throughput instruction, so these unnecessary ops slow the permute by 50%. But I think this is also a compiler bug, as the most significant bit of each selector nibble is defined in the PTX doc as a sign extend flag. I’m not using this feature, but the PTX manual does clearly define a role for the high nibble bit, and nvcc is clearing it.
I’ve replaced all the __byte_perms in my kernel with a function that uses prmt inline assembler and the masks go away.