I was a little surprised that this code didn’t generate a register+immediate store:
shared.keys[row * WARP_SIZE + warp_lane_idx()].b64 = key->b64;
Where ‘row’ is a compile-time constant.
A simple rearrangement produces the desired code and saves 1 register and 1+ instruction per store:
(shared.keys + row * WARP_SIZE)[warp_lane_idx()].b64 = key->b64;
Here is what the SASS looks like for each case. There are 7 back-to-back stores to shared memory:
If you were expecting the address operand to be in register+immediate form and you’re not seeing it in the SASS then you might try tweaking your array access expression. Again, I think the first expression should’ve given me what I wanted.
I haven’t seen this before but this is new code. I’ll double-check ld/st.global and ld.shared when I get the chance.
Platform: CUDA 7.5 RC.