So I tried updating to Cuda 7.0 on ubuntu 14.04 running on some 980GTXs. I had massive performance degradation (at least 4 times slower), I think I’ve narrowed to cause down to the PTX to SASS compiler being silly.
So I’m unsure If I can publicly post the code so I’ll put the parts I think may be relevant.
typedef struct {
int vals[QRY_LEN];
} query_t;
__global__ void
lookup_kernel(const int * const __restrict__ index, gph_kern_result * __restrict__ candidates, query_t qry);
Now the code basic looks at subsets of the integers of index and qry and looks for minimum bit differences. So the main loop consists of things functionally equivalent to.
biterr = __popc(qry[x] ^ index[y]) + __popc(qry[x + 32] ^ index[y + 1]) + __popc(qry[x + 64] ^ index[y + 2]);
best_biter = min(best_biter, bitter);
I’ll look at sm_50 SASS since I know how to extract that from 6.5 code, though the ptx is probably getting recompiled for the 980’s sm_52.
Now in 6.5 The __popc(qry ^ index[y]) get’s transformed directly into a few instruction. NOTE the IADD is for a previous __popc. R22 is from a previous global load.
126 /*0328*/ LOP.XOR R28, R22, c[0x0][0x2ac]; /* 0 x4c4704000ab7161c */
127 /*0330*/ IADD R29, R29, R24; /* 0 x5c10000001871d1d */
128 /*0338*/ POPC R28, R28;
So the core loop is 3 instructions per __popc (xor, pop, add / min)
But in 7.0 the SASS compiler seems to think the constant value from qry needs to be loaded a single byte at a time and SHLed, ORed and LOP3.LUTed back together into the original int. Note I just copied and pasted various pieces that show the relevant instructions
76 /*01a8*/ LDC.U8 R4, c[0x0][R18+0x82]; /* 0xef90000008271204 */
77 /*01b0*/ LDC.U8 R7, c[0x0][R18+0xd7]; /* 0xef9000000d771207 */
78 /*01b8*/ LDC.U8 R3, c[0x0][R18+0x102];
…
100 /*0268*/ I2I.U32.U8 R11, R11; /* 0x5ce0000000b7020b */
101 /*0270*/ LDC.U8 R19, c[0x0][R18+0x157]; /* 0xef90000015771213 */
102 /*0278*/ I2I.U32.U8 R4, R4;
…
120 /*0308*/ { SHL R11, R11, 0x8; /* 0x3848000000870b0b */
121 /*0310*/ I2I.U32.U8 R8, R8; } /* 0x5ce0000000870208 */
122 /*0318*/ { LOP.OR R9, R5, R4; /* 0x5c47020000470509 */
123 /*0328*/ I2I.U32.U8 R4, R21; }
…
340 /*09e8*/ I2I.U32.U8 R16, R31; /* 0x5ce0000001f70210 */
341 /*09f0*/ { SHL R31, R10, 0x10; /* 0x3848000001070a1f */
342 /*09f8*/ LDC.U8 R30, c[0x0][R18+0x155]; } /* 0xef9000001557121e */
343 /* 0x041fc000822007f0 */
344 /*0a08*/ { LOP3.LUT R22, R31, R28, R22, 0xfe; /* 0x5be70b0fe1c71f16 */
345 /*0a10*/ I2I.U32.U8 R27, R21; }
So now 3 instructs is like 10 and register usage explodes, which with my 32 register limit (optimal for 6.5) leads to lots of locals, so performance is decimated.
I could almost understand this behavior if qry could be unaligned but it’s passed by value, so I have no idea why it would decide to load int constants a byte at a time.
I obviously sticking with 6.5 until a find a work around somewhere, or 7.0 gets a bug fix. Though I was hoping to use a couple features from 7.0 they’re not critical.