Is PTXAS mishandling predicates?

I have been working with CUDA for several years. Generally I code and test my programs in the C language on CPU, then hand-compile them into PTX code and retest the kernel first single-threaded then multi-threaded on the GPU, using the CUDA driver API. For compatibility reasons I am still on CUDA toolkit version 6.5.

Occasionally I find strange things happening; the kernel fails (error return from cuMemCpyDtoH) or produces an incorrect result. This might happen on the 15th sub-iteration of the 154307th iteration of the main loop, so it is time-consuming to isolate the cause. Then when I add a couple of PTX statements to get more information back to the host, the problem disappears.

Recently though I have encountered an instance of abnormal behavour which is reproducible in a small kernel, which I present below. I have left in context from the original problem, rather than reducing down to meaningless variable names a,b,c etc.

I am exhaustively iterating combinations of four playing cards. I filter these combinations down to a subset based on their suit patterns. The motivation: suppose I have processed the combination 2c 5c 8c Ac interacting with other playing cards. Later on when I meet the combination 2h 5h 8h Ah I can transpose the suits so that I arrive at the first combination (i.e. swap clubs for hearts), transpose similarly the other cards, and avoid repeating the expensive computation. This is valid as long as the suits are equivalent to one another, as in no-trump play at Contract Bridge (but not bidding).

The sub-case that the toy kernel deals with occurs when it has already been established that of the four cards two are clubs and the other two are of different suits. Then for the card combination to be in “canonical form” the first non-club encountered when traversing the array of four cards must be a diamond and the second a heart. The test pattern is club-diamond-club-heart which is in canonical form. In the PTX the suits are encoded clubs=0, diamonds=1, hearts=2, spades=3. The algorithm used is to set variable other_suit to diamonds, then index through the array. The first card must be a club or a diamond. If the latter, increment other_suit to hearts and so on.

Examining SASS output from the cuobjdump utility reveals that the assembler has moved the conditional increment of other_suit (offset a8 hex) to after the second part of the test, but that part alters the controlling predicate. As a result, other_suit is erroneously incremented when the card is a club, and the kernel returns the result “non-canonical”.

.version 4.1
.target sm_50
.entry BadPredicate(
.param.b32 result){
.local.u32 card_suit[4];
.reg.u32 i,base,ptr,offset,suit,other_suit,output;
.reg.pred p;
st.local.u32 [card_suit+0*4],0;
st.local.u32 [card_suit+1*4],1;
st.local.u32 [card_suit+2*4],0;
st.local.u32 [card_suit+3*4],2;
mov.u32 other_suit,1;
mov.u32 base,card_suit;
mov.u32 i,0;
next_i:
shl.b32 offset,i,2;
add.u32 ptr,base,offset;
ld.local.u32 suit,[ptr];
setp.ne.u32 p,suit,other_suit;
@!p add.u32 other_suit,other_suit,1;
@p setp.ne.u32 p,suit,0;
@p bra non_canonical;
add.u32 i,i,1;
setp.lo.u32 p,i,4;
@p bra next_i;
ld.param.u32 output,[result]; 
st.global.u32 [output],999; //signifying canonical
ret;
non_canonical:
ld.param.u32 output,[result];
st.global.u32 [output],888; //signifying non-canonical
ret;
}
	code for sm_50
		Function : BadPredicate
	.headerflags    @"EF_CUDA_SM50 EF_CUDA_PTX_SM(EF_CUDA_SM50)"
                                                                                                 /* 0x001fc800fe0007e6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                     /* 0x4c98078000870001 */
        /*0010*/                   IADD R1, R1, 0x10.NEG;                    /* 0x3811000001070101 */
        /*0018*/                   PBK 0x130;                                /* 0xe2a0000011000000 */
                                                                             /* 0x001fc4001c2007f0 */
        /*0028*/                   MOV32I R4, 0x1;                           /* 0x010000000017f004 */
        /*0030*/                   STL [R1], RZ;                             /* 0xef540000000701ff */
        /*0038*/                   MOV32I R3, 0x2;                           /* 0x010000000027f003 */
                                                                             /* 0x001fc0005e2001f1 */
        /*0048*/                   STL [R1+0x8], RZ;                         /* 0xef540000008701ff */
        /*0050*/                   STL [R1+0x4], R4;                         /* 0xef54000000470104 */
        /*0058*/                   MOV32I R0, 0x1;                           /* 0x010000000017f000 */
                                                                             /* 0x001f880ffcc003fd */
        /*0068*/                   STL [R1+0xc], R3;                         /* 0xef54000000c70103 */
        /*0070*/                   MOV R2, RZ;                               /* 0x5c9807800ff70002 */
        /*0078*/                   ISCADD R3, R2, R1, 0x2;                   /* 0x5c18010000170203 */
                                                                             /* 0x001fb420fda007a2 */
        /*0088*/                   LDL R3, [R3];                             /* 0xef44000000070303 */
        /*0090*/                   ISETP.NE.U32.AND P0, PT, R3, R0, PT;      /* 0x5b6a038000070307 */
        /*0098*/               @P0 ISETP.NE.U32.AND P0, PT, R3, RZ, PT;      /* 0x5b6a03800ff00307 */
                                                                             /* 0x001f9800ffa007f0 */
        /*00a8*/              @!P0 IADD R0, R0, 0x1;                         /* 0x3810000000180000 */
        /*00b0*/               @P0 BRA 0x108;                                /* 0xe24000000500000f */
        /*00b8*/                   IADD R2, R2, 0x1;                         /* 0x3810000000170202 */
                                                                             /* 0x001fc400ffa007ed */
        /*00c8*/                   ISETP.LT.U32.AND P0, PT, R2, 0x4, PT;     /* 0x3662038000470207 */
        /*00d0*/               @P0 BRA 0x78;                                 /* 0xe2400ffffa00000f */
        /*00d8*/                   MOV R0, c[0x0][0x140];                    /* 0x4c98078005070000 */
                                                                             /* 0x003ff4001fa007e2 */
        /*00e8*/                   MOV32I R2, 0x3e7;                         /* 0x010000003e77f002 */
        /*00f0*/                   STG [R0], R2;                             /* 0xeedc000000070002 */
        /*00f8*/                   BRK;                                      /* 0xe34000000007000f */
                                                                             /* 0x0003f400fc4007f1 */
        /*0108*/                   MOV R0, c[0x0][0x140];                    /* 0x4c98078005070000 */
        /*0110*/                   MOV32I R2, 0x378;                         /* 0x010000003787f002 */
        /*0118*/                   STG [R0], R2;                             /* 0xeedc000000070002 */
                                                                             /* 0x001ffc00ffe00ffd */
        /*0128*/                   BRK;                                      /* 0xe34000000007000f */
        /*0130*/                   EXIT;                                     /* 0xe30000000007000f */
        /*0138*/                   BRA 0x138;                                /* 0xe2400fffff87000f */
		.............................

I have three questions:

a) Can this really be a bug? It seems so blatant I am beginning to doubt my own sanity.

b) If so, which PTX constructs should I avoid using?

c) Where can I find the list of bugs and bug fixes for each version of the CUDA toolkit?

TIA Dar

I don’t know that it is a bug, because I haven’t studied your code for long enough, and I would generally prefer to have a complete test case. But bugs are always possible.

I have no idea. That isn’t normally the way I try to address such a problem.

There is no such animal that I am aware of. However each CUDA toolkit has release notes which contain some info. But the archived notes only go back to CUDA 8.0

When I compile your PTX code with CUDA 12.0, I get the following code:

$ cuobjdump -sass t27.cubin

        code for sm_50
                Function : BadPredicate
        .headerflags    @"EF_CUDA_TEXMODE_UNIFIED EF_CUDA_SM50 EF_CUDA_VIRTUAL_SM(EF_CUDA_SM50)"
                                                                                /* 0x001fd800fe2007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                       /* 0x4c98078000870001 */
        /*0010*/                   MOV32I R0, 0x1 ;                             /* 0x010000000017f000 */
        /*0018*/                   MOV R3, RZ ;                                 /* 0x5c9807800ff70003 */
                                                                                /* 0x081fc440fe2007f6 */
        /*0028*/                   SHL R4, R3, 0x2 ;                            /* 0x3848000000270304 */
        /*0030*/                   ISETP.EQ.AND P2, PT, R4.reuse, RZ, PT ;      /* 0x5b6503800ff70417 */
        /*0038*/                   ISETP.EQ.AND P1, PT, R4.reuse, 0x4, PT ;     /* 0x366503800047040f */
                                                                                /* 0x001f8400fe2207fb */
        /*0048*/                   ISETP.EQ.AND P0, PT, R4.reuse, 0x8, PT ;     /* 0x3665038000870407 */
        /*0050*/               @P2 MOV R2, RZ ;                                 /* 0x5c9807800ff20002 */
        /*0058*/                   ISETP.EQ.AND P2, PT, R4, 0xc, PT ;           /* 0x3665038000c70417 */
                                                                                /* 0x001fd800ff6007f1 */
        /*0068*/               @P1 MOV32I R2, 0x1 ;                             /* 0x010000000011f002 */
        /*0070*/               @P0 MOV R2, RZ ;                                 /* 0x5c9807800ff00002 */
        /*0078*/               @P2 MOV32I R2, 0x2 ;                             /* 0x010000000022f002 */
                                                                                /* 0x001fc000fda207f1 */
        /*0088*/                   ISETP.NE.U32.AND P0, PT, R2.reuse, R0, PT ;  /* 0x5b6a038000070207 */
        /*0090*/                   ISET.NE.U32.AND RZ.CC, R2, RZ, PT ;          /* 0x5b5a83800ff702ff */
        /*0098*/              @!P0 IADD32I R0, R0, 0x1 ;                        /* 0x1c00000000180000 */
                                                                                /* 0x001fb400fec007fd */
        /*00a8*/               @P0 BRA CC.NEU, 0xf8 ;                           /* 0xe24000000480000d */
        /*00b0*/                   IADD32I R3, R3, 0x1 ;                        /* 0x1c00000000170303 */
        /*00b8*/                   ISETP.LT.U32.AND P0, PT, R3, 0x4, PT ;       /* 0x3662038000470307 */
                                                                                /* 0x001fc800fe2007fd */
        /*00c8*/               @P0 BRA 0x20 ;                                   /* 0xe2400ffff500000f */
        /*00d0*/                   MOV R0, c[0x0][0x140] ;                      /* 0x4c98078005070000 */
        /*00d8*/                   MOV32I R2, 0x3e7 ;                           /* 0x010000003e77f002 */
                                                                                /* 0x001fc400ffe007f1 */
        /*00e8*/                   STG [R0], R2 ;                               /* 0xeedc000000070002 */
        /*00f0*/                   EXIT ;                                       /* 0xe30000000007000f */
        /*00f8*/                   MOV R2, c[0x0][0x140] ;                      /* 0x4c98078005070002 */
                                                                                /* 0x001fbc00fe2007f2 */
        /*0108*/                   MOV32I R0, 0x378 ;                           /* 0x010000003787f000 */
        /*0110*/                   STG [R2], R0 ;                               /* 0xeedc000000070200 */
        /*0118*/                   NOP ;                                        /* 0x50b0000000070f00 */
                                                                                /* 0x001ffc00fde007ef */
        /*0128*/                   NOP ;                                        /* 0x50b0000000070f00 */
        /*0130*/                   NOP ;                                        /* 0x50b0000000070f00 */
        /*0138*/                   EXIT ;                                       /* 0xe30000000007000f */
                                                                                /* 0x001f8000fc0007ff */
        /*0148*/                   BRA 0x140 ;                                  /* 0xe2400fffff07000f */
        /*0150*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*0158*/                   NOP;                                         /* 0x50b0000000070f00 */
                                                                                /* 0x001f8000fc0007e0 */
        /*0168*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*0170*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*0178*/                   NOP;                                         /* 0x50b0000000070f00 */
                ..........

It looks to me like the generated code is different in the area of concern. The P0 predicate is set equivalently at 0x88, and is not modified before the conditional increment of other_suit at 0x98.

If you think that would fix things, then my suggestion would be to advance to a newer CUDA toolkit. Testing on the latest CUDA toolkit is always a good idea when a bug is suspected. Bugs are always possible, and bugs get fixed frequently.

Thank you Mr Crovella for spending time on my problem.

As far as I can see the object code that you presented from the latest CUDA toolkit will work (and the optimiser uses the known values in the card_suit array to avoid local memory accesses).

I am awaiting a hardware upgrade before I move to a later CUDA version. Until then I will avoid setting a predicate in a statement that is guarded by that same predicate.

Actually there is a separate problem that I encountered when using complex predicate expressions, one that I cannot pin down. Accordingly I will keep the PTX code as simple as might be emitted from the front end of NVCC.