I have been working with CUDA for several years. Generally I code and test my programs in the C language on CPU, then hand-compile them into PTX code and retest the kernel first single-threaded then multi-threaded on the GPU, using the CUDA driver API. For compatibility reasons I am still on CUDA toolkit version 6.5.
Occasionally I find strange things happening; the kernel fails (error return from cuMemCpyDtoH) or produces an incorrect result. This might happen on the 15th sub-iteration of the 154307th iteration of the main loop, so it is time-consuming to isolate the cause. Then when I add a couple of PTX statements to get more information back to the host, the problem disappears.
Recently though I have encountered an instance of abnormal behavour which is reproducible in a small kernel, which I present below. I have left in context from the original problem, rather than reducing down to meaningless variable names a,b,c etc.
I am exhaustively iterating combinations of four playing cards. I filter these combinations down to a subset based on their suit patterns. The motivation: suppose I have processed the combination 2c 5c 8c Ac interacting with other playing cards. Later on when I meet the combination 2h 5h 8h Ah I can transpose the suits so that I arrive at the first combination (i.e. swap clubs for hearts), transpose similarly the other cards, and avoid repeating the expensive computation. This is valid as long as the suits are equivalent to one another, as in no-trump play at Contract Bridge (but not bidding).
The sub-case that the toy kernel deals with occurs when it has already been established that of the four cards two are clubs and the other two are of different suits. Then for the card combination to be in “canonical form” the first non-club encountered when traversing the array of four cards must be a diamond and the second a heart. The test pattern is club-diamond-club-heart which is in canonical form. In the PTX the suits are encoded clubs=0, diamonds=1, hearts=2, spades=3. The algorithm used is to set variable other_suit to diamonds, then index through the array. The first card must be a club or a diamond. If the latter, increment other_suit to hearts and so on.
Examining SASS output from the cuobjdump utility reveals that the assembler has moved the conditional increment of other_suit (offset a8 hex) to after the second part of the test, but that part alters the controlling predicate. As a result, other_suit is erroneously incremented when the card is a club, and the kernel returns the result “non-canonical”.
.version 4.1
.target sm_50
.entry BadPredicate(
.param.b32 result){
.local.u32 card_suit[4];
.reg.u32 i,base,ptr,offset,suit,other_suit,output;
.reg.pred p;
st.local.u32 [card_suit+0*4],0;
st.local.u32 [card_suit+1*4],1;
st.local.u32 [card_suit+2*4],0;
st.local.u32 [card_suit+3*4],2;
mov.u32 other_suit,1;
mov.u32 base,card_suit;
mov.u32 i,0;
next_i:
shl.b32 offset,i,2;
add.u32 ptr,base,offset;
ld.local.u32 suit,[ptr];
setp.ne.u32 p,suit,other_suit;
@!p add.u32 other_suit,other_suit,1;
@p setp.ne.u32 p,suit,0;
@p bra non_canonical;
add.u32 i,i,1;
setp.lo.u32 p,i,4;
@p bra next_i;
ld.param.u32 output,[result];
st.global.u32 [output],999; //signifying canonical
ret;
non_canonical:
ld.param.u32 output,[result];
st.global.u32 [output],888; //signifying non-canonical
ret;
}
code for sm_50
Function : BadPredicate
.headerflags @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
/* 0x001fc800fe0007e6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ IADD R1, R1, 0x10.NEG; /* 0x3811000001070101 */
/*0018*/ PBK 0x130; /* 0xe2a0000011000000 */
/* 0x001fc4001c2007f0 */
/*0028*/ MOV32I R4, 0x1; /* 0x010000000017f004 */
/*0030*/ STL [R1], RZ; /* 0xef540000000701ff */
/*0038*/ MOV32I R3, 0x2; /* 0x010000000027f003 */
/* 0x001fc0005e2001f1 */
/*0048*/ STL [R1+0x8], RZ; /* 0xef540000008701ff */
/*0050*/ STL [R1+0x4], R4; /* 0xef54000000470104 */
/*0058*/ MOV32I R0, 0x1; /* 0x010000000017f000 */
/* 0x001f880ffcc003fd */
/*0068*/ STL [R1+0xc], R3; /* 0xef54000000c70103 */
/*0070*/ MOV R2, RZ; /* 0x5c9807800ff70002 */
/*0078*/ ISCADD R3, R2, R1, 0x2; /* 0x5c18010000170203 */
/* 0x001fb420fda007a2 */
/*0088*/ LDL R3, [R3]; /* 0xef44000000070303 */
/*0090*/ ISETP.NE.U32.AND P0, PT, R3, R0, PT; /* 0x5b6a038000070307 */
/*0098*/ @P0 ISETP.NE.U32.AND P0, PT, R3, RZ, PT; /* 0x5b6a03800ff00307 */
/* 0x001f9800ffa007f0 */
/*00a8*/ @!P0 IADD R0, R0, 0x1; /* 0x3810000000180000 */
/*00b0*/ @P0 BRA 0x108; /* 0xe24000000500000f */
/*00b8*/ IADD R2, R2, 0x1; /* 0x3810000000170202 */
/* 0x001fc400ffa007ed */
/*00c8*/ ISETP.LT.U32.AND P0, PT, R2, 0x4, PT; /* 0x3662038000470207 */
/*00d0*/ @P0 BRA 0x78; /* 0xe2400ffffa00000f */
/*00d8*/ MOV R0, c[0x0][0x140]; /* 0x4c98078005070000 */
/* 0x003ff4001fa007e2 */
/*00e8*/ MOV32I R2, 0x3e7; /* 0x010000003e77f002 */
/*00f0*/ STG [R0], R2; /* 0xeedc000000070002 */
/*00f8*/ BRK; /* 0xe34000000007000f */
/* 0x0003f400fc4007f1 */
/*0108*/ MOV R0, c[0x0][0x140]; /* 0x4c98078005070000 */
/*0110*/ MOV32I R2, 0x378; /* 0x010000003787f002 */
/*0118*/ STG [R0], R2; /* 0xeedc000000070002 */
/* 0x001ffc00ffe00ffd */
/*0128*/ BRK; /* 0xe34000000007000f */
/*0130*/ EXIT; /* 0xe30000000007000f */
/*0138*/ BRA 0x138; /* 0xe2400fffff87000f */
.............................
I have three questions:
a) Can this really be a bug? It seems so blatant I am beginning to doubt my own sanity.
b) If so, which PTX constructs should I avoid using?
c) Where can I find the list of bugs and bug fixes for each version of the CUDA toolkit?
TIA Dar