Inconsisten PTX generation with uint2 instruction?

Hi all,

I have been running into a weird PTX code generation with storing uint2 to global memory. Consider the following code snippets:

Code A:

uint2 pair;

uint2 *dpairs_out;

__shared__ unsigned lsum[16];

...

dpairs_out[threadIdx.x+lsum[key]] = pair;

Code B:

unsigned index;

uint2 pair;

uint2 *dpairs_out;

__shared__ unsigned lsum[16];

...

index = threadIdx.x+lsum[key];

dpairs_out[index] = pair;

(the only difference with B is that I store the offset into the variable index).

Amazingly, code B runs noticeably faster than A. Looking at the PTX code generation, I see that in A, there are two st.global.u32 instructions generated for the uint2 assignment, while in B, a single st.global.v2.u32 is used.

This does explains why B is faster, however, what I am worried about is that it is very easy for me to mix up A and B, how should I approach this without looking at the PTX code? Do you have any suggestion?

This doesn’t happen if I replaced uint2 with (long long unsigned), a 64-bit unsigned, there is just a single st.global.u64 for that.

Thanks!

Huy