It seems that storing aligned structures in constant memory doesn’t help performance, this can also be noticed in the ptx code generated by nvcc. There is no ld.const.v2.s16 or ld.const.v4.s16 etc ever generated.
For example. the following kernel:
struct __align__((8)) _MotionVector {
int16_t x1,x2,y1,y2;
};
__constant__ _MotionVector motion_vectors[100];
static __global__ void s_transform_h( int16_t* data )
{
_MotionVector x = motion_vectors[0];
data[0] = x.x1;
data[1] = x.x2;
data[2] = x.y1;
data[3] = x.y2;
}
Is converted to this ptx code:
ld.param.u64 $rd1, [__cudaparm_data]; // id:23 __cudaparm_data+0x0
ld.const.s16 $rh1, [motion_vectors+0]; // id:24 motion_vectors+0x0
st.global.s16 [$rd1+0], $rh1; // id:25
.loc 2 18 0
ld.const.s16 $rh2, [motion_vectors+2]; // id:26 motion_vectors+0x2
st.global.s16 [$rd1+2], $rh2; // id:27
.loc 2 19 0
ld.const.s16 $rh3, [motion_vectors+4]; // id:28 motion_vectors+0x4
st.global.s16 [$rd1+4], $rh3; // id:29
.loc 2 20 0
ld.const.s16 $rh4, [motion_vectors+6]; // id:30 motion_vectors+0x6
st.global.s16 [$rd1+6], $rh4; // id:31
I know really little about compiling ptx manually. Could someone explain me how to do this, or test if the instructions to do 64 bit reads from constant memory exist at all?