It looks like a compiler optimization, believe it or not. Your first code compiles to this:
$LDWbegin__Z6kernelP6float2S0_:
.loc 28 5 0
cvt.u32.u16 %r1, %tid.x;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r2, %rh1, %rh2;
add.u32 %r3, %r1, %r2;
cvt.s64.s32 %rd1, %r3;
mul.wide.s32 %rd2, %r3, 8;
ld.param.u64 %rd3, [__cudaparm__Z6kernelP6float2S0__in];
add.u64 %rd4, %rd3, %rd2;
ld.global.f32 %f1, [%rd4+4];
.loc 28 8 0
ld.param.u64 %rd5, [__cudaparm__Z6kernelP6float2S0__out];
add.u64 %rd6, %rd5, %rd2;
mov.f32 %f2, 0f42c80000; // 100
st.global.v2.f32 [%rd6+0], {%f2,%f1};
.loc 28 9 0
exit;
You can see that the compiler has elected not to load both halves of the float2 because the second word isn’t used. The store is coalesced.
Changing the kernel to this:
__global__ void kernel(float2 *in, float2 *out) {
int idx=blockIdx.x*blockDim.x+threadIdx.x;
float2 d=in[idx];
d.x += 100.f;
out[idx] = d;
}
produces this:
$LDWbegin__Z6kernelP6float2S0_:
.loc 28 5 0
cvt.u32.u16 %r1, %tid.x;
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r2, %rh1, %rh2;
add.u32 %r3, %r1, %r2;
cvt.s64.s32 %rd1, %r3;
mul.wide.s32 %rd2, %r3, 8;
ld.param.u64 %rd3, [__cudaparm__Z6kernelP6float2S0__in];
add.u64 %rd4, %rd3, %rd2;
ld.global.v2.f32 {%f1,%f2}, [%rd4+0];
.loc 28 8 0
ld.param.u64 %rd5, [__cudaparm__Z6kernelP6float2S0__out];
add.u64 %rd6, %rd5, %rd2;
mov.f32 %f3, 0f42c80000; // 100
add.f32 %f4, %f1, %f3;
st.global.v2.f32 [%rd6+0], {%f4,%f2};
.loc 28 9 0
exit;
This version loads both halves of the float2 and should be coalesced. So the trick seems to be you need to use both parts of the float2, otherwise the compiler will optimize the redundant load away and break coalescing rules.