Alignment doesn't help for constant memory?

It seems that storing aligned structures in constant memory doesn’t help performance, this can also be noticed in the ptx code generated by nvcc. There is no ld.const.v2.s16 or ld.const.v4.s16 etc ever generated.

For example. the following kernel:

struct __align__((8)) _MotionVector {

    int16_t x1,x2,y1,y2;


__constant__ _MotionVector motion_vectors[100];

static __global__ void s_transform_h( int16_t* data )


    _MotionVector x = motion_vectors[0];

   data[0] = x.x1;

    data[1] = x.x2;

    data[2] = x.y1;

    data[3] = x.y2;


Is converted to this ptx code:

       ld.param.u64    $rd1, [__cudaparm_data];        //  id:23 __cudaparm_data+0x0

        ld.const.s16    $rh1, [motion_vectors+0];       //  id:24 motion_vectors+0x0    [$rd1+0], $rh1; //  id:25

        .loc    2       18      0

        ld.const.s16    $rh2, [motion_vectors+2];       //  id:26 motion_vectors+0x2   [$rd1+2], $rh2; //  id:27

        .loc    2       19      0

        ld.const.s16    $rh3, [motion_vectors+4];       //  id:28 motion_vectors+0x4   [$rd1+4], $rh3; //  id:29

        .loc    2       20      0   

        ld.const.s16    $rh4, [motion_vectors+6];       //  id:30 motion_vectors+0x6   [$rd1+6], $rh4; //  id:31

I know really little about compiling ptx manually. Could someone explain me how to do this, or test if the instructions to do 64 bit reads from constant memory exist at all?

I just tested. Yes, ptxas does support ld.const.v4.s16.
However, I doubt there would be any noticeable performance gain.
Testing performance is tricky in that const memory time is mostly insignificant comparing to global time, and I haven’t done that yet.

I did notice the number of accesses to constant memory is very important, especially if different threads can potentially access different offsets.

I already submitted a bug report to NVidia on a similar case when accessing global memory, I hope that will catch this one too.

Seems you are likely right. I tried loading a 32 or 64 bit value and then splitting up the fields manually using bit shifts, and it became slower instead of faster. Of course, the ld.const.v4.s16 instruction would be faster then bitshifts, but I doubt it would make much difference.