Hi all,
I have been running into a weird PTX code generation with storing uint2 to global memory. Consider the following code snippets:
Code A:
uint2 pair;
uint2 *dpairs_out;
__shared__ unsigned lsum[16];
...
dpairs_out[threadIdx.x+lsum[key]] = pair;
Code B:
unsigned index;
uint2 pair;
uint2 *dpairs_out;
__shared__ unsigned lsum[16];
...
index = threadIdx.x+lsum[key];
dpairs_out[index] = pair;
(the only difference with B is that I store the offset into the variable index).
Amazingly, code B runs noticeably faster than A. Looking at the PTX code generation, I see that in A, there are two st.global.u32 instructions generated for the uint2 assignment, while in B, a single st.global.v2.u32 is used.
This does explains why B is faster, however, what I am worried about is that it is very easy for me to mix up A and B, how should I approach this without looking at the PTX code? Do you have any suggestion?
This doesn’t happen if I replaced uint2 with (long long unsigned), a 64-bit unsigned, there is just a single st.global.u64 for that.
Thanks!
Huy