I’m loading my data using the short4 type to achieve bigger memory bandwidth. But the compiler seems to generate very inefficient code. Here’s a simple case:
__global__ void BandwidthTest(int16_t *in)
{
__shared__ short4 mydata;
for (int i = ...)
{
mydata = *(short4 *)in;
}
}
This is the resulting Kepler assembly code from cuobjdump
/*0078*/ LD.E.64 R6, [R8]; /* 0x8400000000819ca5 */
/* 0x2282023232323047 */
/*0088*/ IADD R0, R3, R0; /* 0x4800000000301c03 */
/*0090*/ I2I.U32.U16 R4, R6; /* 0x1c00000018a11c04 */
/*0098*/ I2I.U32.U16 R2, R6.H1; /* 0x1d00000018a09c04 */
/*00a0*/ I2I.U16.U16 R6, R7; /* 0x1c0000001c919c04 */
/*00a8*/ I2I.U32.U16 R5, R7.H1; /* 0x1d0000001ca15c04 */
/*00b0*/ BFI R4, R2, 0x1010, R4; /* 0x2808c04040211c03 */
/*00b8*/ IADD R2, R0, 0x4; /* 0x4800c00010009c03 */
/* 0x200002e2e042c047 */
/*00c8*/ BFI R5, R5, 0x1010, R6; /* 0x280cc04040515c03 */
/*00d0*/ ISETP.LT.AND P0, PT, R2, c[0x0][0x150], PT; /* 0x188e40054021dc23 */
/*00d8*/ STS.64 [RZ], R4; /* 0xc900000003f11ca5 */
/*00e0*/ @P0 BRA 0x78; /* 0x4003fffe400001e7 */
The inefficiency is that it’s unpacking (I2I) the 16bit data to 32bits and repacking them again (BFI). Is there some arcane semantic in CUDA that insists it has to be done this way?
I have found a work around by using uint2 instead and manually writing inline PTX code that explicitly use either the low or high half of the 32 bit register instead of extracting them. Should I file this issue to NVIDIA?