Coaxing a [register+immediate] offset store

I was a little surprised that this code didn’t generate a register+immediate store:

shared.keys[row * WARP_SIZE + warp_lane_idx()].b64 = key->b64;

Where ‘row’ is a compile-time constant.

A simple rearrangement produces the desired code and saves 1 register and 1+ instruction per store:

(shared.keys + row * WARP_SIZE)[warp_lane_idx()].b64 = key->b64;

Here is what the SASS looks like for each case. There are 7 back-to-back stores to shared memory:

If you were expecting the address operand to be in register+immediate form and you’re not seeing it in the SASS then you might try tweaking your array access expression. Again, I think the first expression should’ve given me what I wanted.

I haven’t seen this before but this is new code. I’ll double-check ld/st.global and ld.shared when I get the chance.

Platform: CUDA 7.5 RC.

What kind of performance difference are you seeing between the two variants?

Generally speaking, my experience with the CUDA compiler code generation is that there are so many layers of transformation between the source code and the final internal representation from which SASS is generated that most attempts to steer SASS generation in a particular direction amount to an accumulation of implementation artifacts, a brittle construct often destroyed by the next release.

One particular battle I used to wage was the generation of predicated instructions vs select-type instructions. But by this time I have pretty much given up. I also noticed that the unexpected combinations of the two idioms the CUDA compilers produces often do deliver better performance, apparently it knows a few things I do not. There are many instances of “wierd” compiler idioms (such as the use of MULHI x, 2 for sign extension) that seem to make little sense to me, but its not clear whether these are bugs or features. Just the other day I noticed (x >> s) + y translated into MADHI x, y, (1<<(32-s)), even on sm_5x where the MADHI needs to be emulated by a three-instruction sequence. One would think a funnel shift followed by add would be faster?

I didn’t expect to see any performance difference since it’s already ~8-9 usecs and run-to-run varies by 1 usec. Shorter answer: I can’t see any difference in performance other than the extra register.

I was more concerned about register pressure and thought the codegen was curious since the kernel’s load/store idiom enables true constant offset accesses.

Note that on platforms with 64-bit pointers, the compiler cannot automatically transform code like

int *arr;
unsigned i, j;

a[i+j]

into “(a+i)[j]”, “(a+j)[i]” or “*(a+i+j)”, unless it can deduce that “i+j” does not overflow. If “i+j” overflows in the unsigned type, the transformation would change the resulting pointer. If i+j overflows in a signed type, it would invoke undefined behavior, so the compiler can safely optimize. If i+j overflows in a 64-bit type, the compiler still can safely make the transformation.

Thanks @monoid!

That explains it and I verified that it results in the expected optimization.

Some more on undefined behavior and why it’s useful for optimization:

http://blog.regehr.org/archives/213
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_14.html
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know_21.html
Undefined Behavior: What Happened to My Code?

▶▶▶ Caveman summary: try to avoid unsigned integers unless you’re actually relying on defined behavior. ◀◀◀

There are several good examples in the links.

While I am aware of the optimization impact of signed versus unsigned arithmetic, I am not sure how it applies here. The addresses used with STS should be memory-space specific, not generic 64-bit addresses, and thus should be represented as 32-bit quantities meaning the transformation would be valid even with 32-bit wrap-around. It is early in the morning, I have not had coffee yet, what am I missing?

A wise boss of mine with more than two decades of experience told me many years ago: “Every integer quantity should be represented as an ‘int’, unless there is an exceedingly good reason for it to be anything else”. Over the years, I have found that to be true.

@njuffa, I agree that non-generic shared pointers could be memory-space specific but the PTX shows the shared pointer is 64-bits in a 64-bit compile:

mov.u64 	%rd95, shared;
add.s64 	%rd96, %rd95, %rd94;
st.shared.u64 	[%rd96], %rd93;

Maybe this can be improved in the compiler… not sure what the long-term cost would be to the codebase.

It might be too much of a hassle to support generic addressing and memory-space specific pointers?

You could probably hand-code PTX that would intermingle 64-bit global pointers with 32-bit shared pointers, no?

I’m also curious if anyone has recommendations for C++ static analysis tools that won’t barf on CUDA kernels?

The nvvm portion of the CUDA compiler is pretty much architecture agnostic. Keeping it that way seems like a good thing because that means it can continue to absorb generic LLVM goodness. Memory spaces are something that run counter to the C/C++ philosophy of “a pointer is a pointer is a pointer”, so keeping this out of nvvm seems beneficial.

ptxas on the other hand is an optimizing compiler, not just an assembler, with knowledge of the target architecture. At some point in the compilation ptxas is clearly aware of the fact that addresses specific to the shared memory space are limited to 32 bits. At least I cannot recall ever seeing it emit machine code that carries around 64-bit addresses when accessing shared memory. It also emits memory space specific load and store instructions (LDC, LDL, LDS, STL, STS) which is another indication it has knowledge of these particular memory spaces.

It stands to reason that it should be able to transform compile-time constant offset computation into the [reg+imm] addressing mode when used with LDS and STS instructions. For code operating on shared memory both array indexes and addresses are 32-bit quantities, so there should be no issues with wrap-around even if the indexes are unsigned integers. It seems to me that similar analysis would apply to addresses specific to the local and constant memory space, which likewise are restricted to 32 bits.

What kind of C++ static analysis tool are you looking for? What do you use for C++ code on the host?

Ah, you’re right (no surprise).

Inspecting the SASS a little more verifies that the 64-bit pointer to shared mem in PTX is reduced to a 32-bit register for the STS address operand. Huzzah.

So the unsigned constant offset + %laneid result remains curious… while a signed constant offset does generate the expected register+immediate form.

Re: static analysis – I’m using VS2013 and would like to hear if others have integrated either a lint’er or static analysis tool into their build. There are a lot of new tools out there… probably because of the success of clang/LLVM.

compile this to ptx code:

sharedMemory is a uint32 *

y0 = sharedMemory[x0 & 0xff]

In the ptx code you need 4 instructions just to calculate the adress

load a pointer
x0=x0 & 0xff
left shift x0 2 (*4)
add x0 to the pointer register
load the value from shared mem

ld.shared.u32 	%r10282, [%r10281+1024];
	and.b32 	%r10283, %r28, 0xff;
	shl.b32 	%r10283, %r28, 2;
	add.s32 	%r10284, %r10283, %r10280;
	ld.shared.u32 	%r10285, [%r10284+2048];

you can remove the and, the shift and the add by using a bitfield instert directly into the pointer:

https://github.com/sp-hash/ccminer/commit/6313154da0fb3dd154896779f6c584bb5cf9e626

Load an alligned pointer (the low bits in the register is 0)
bfins 8 bits from position 2 (This does the and the add and the shift in one instruction).
load from memory.

5 instructions has shrinked to 3 instructions. Small speedup on maxwell

resulting in this:

ld.shared.u32 	%r16175, [%r16174+3072];
	bfi.b32 %r6880, %r6924, %r10244, %r10245,%r10258;
	ld.u32 	%r16176, [%r6880];