I have been running into a weird PTX code generation with storing uint2 to global memory. Consider the following code snippets:
uint2 pair; uint2 *dpairs_out; __shared__ unsigned lsum; ... dpairs_out[threadIdx.x+lsum[key]] = pair;
unsigned index; uint2 pair; uint2 *dpairs_out; __shared__ unsigned lsum; ... index = threadIdx.x+lsum[key]; dpairs_out[index] = pair;
(the only difference with B is that I store the offset into the variable index).
Amazingly, code B runs noticeably faster than A. Looking at the PTX code generation, I see that in A, there are two st.global.u32 instructions generated for the uint2 assignment, while in B, a single st.global.v2.u32 is used.
This does explains why B is faster, however, what I am worried about is that it is very easy for me to mix up A and B, how should I approach this without looking at the PTX code? Do you have any suggestion?
This doesn’t happen if I replaced uint2 with (long long unsigned), a 64-bit unsigned, there is just a single st.global.u64 for that.