inefficient code when using vector types

I’m loading my data using the short4 type to achieve bigger memory bandwidth. But the compiler seems to generate very inefficient code. Here’s a simple case:

__global__ void BandwidthTest(int16_t *in)
  __shared__ short4 mydata;
  for (int i = ...)
    mydata = *(short4 *)in;

This is the resulting Kepler assembly code from cuobjdump

/*0078*/                LD.E.64 R6, [R8];                           /* 0x8400000000819ca5 */
                                                                            /* 0x2282023232323047 */
        /*0088*/                IADD R0, R3, R0;                            /* 0x4800000000301c03 */
        /*0090*/                I2I.U32.U16 R4, R6;                         /* 0x1c00000018a11c04 */
        /*0098*/                I2I.U32.U16 R2, R6.H1;                      /* 0x1d00000018a09c04 */
        /*00a0*/                I2I.U16.U16 R6, R7;                         /* 0x1c0000001c919c04 */
        /*00a8*/                I2I.U32.U16 R5, R7.H1;                      /* 0x1d0000001ca15c04 */
        /*00b0*/                BFI R4, R2, 0x1010, R4;                     /* 0x2808c04040211c03 */
        /*00b8*/                IADD R2, R0, 0x4;                           /* 0x4800c00010009c03 */
                                                                            /* 0x200002e2e042c047 */
        /*00c8*/                BFI R5, R5, 0x1010, R6;                     /* 0x280cc04040515c03 */
        /*00d0*/                ISETP.LT.AND P0, PT, R2, c[0x0][0x150], PT; /* 0x188e40054021dc23 */
        /*00d8*/                STS.64 [RZ], R4;                            /* 0xc900000003f11ca5 */
        /*00e0*/            @P0 BRA 0x78;                                   /* 0x4003fffe400001e7 */

The inefficiency is that it’s unpacking (I2I) the 16bit data to 32bits and repacking them again (BFI). Is there some arcane semantic in CUDA that insists it has to be done this way?

I have found a work around by using uint2 instead and manually writing inline PTX code that explicitly use either the low or high half of the 32 bit register instead of extracting them. Should I file this issue to NVIDIA?

Yeah, that looks bad.

I just tried a number of tests with the short4 jammed into a union with larger integral types. It’s even worse in that the short4 always forces the unnecessary I2I’s. If you explicitly access the larger integral type then proper code is generated.

It also happens with a short2 type.

Seems like a bug to me unless we’re missing some subtle requirement.

Thanks for confirming. I checked the PTX too, and there are no redundant instructions there. {%rs1, %rs2, %rs3, %rs4}, [%rd1];
st.shared.v4.u16 [_Z18WriteBandwidthTestPsS_i$__cuda_local_var_460590_40_non_const_temp], {%rs1, %rs2, %rs3, %rs4};

It seems the Kepler compiler is just doing a literal translation of the PTX, which would explain the extra instructions.

What really needs to be implemented is scalarization of aggregates:

Can someone from NVIDIA comment on when we can see this?

Please file a bug, attaching a self-contained repro case. The bug reporting form is linked from the registered developer website. Thanks.