CUDA 7: NSight Warning: "Stepper Failed: Trying to step invalid warp"


I am trying to port my application to CUDA 7 (from 6.5), but unfortunately the resulting behaviour changes and is not comprehensible to me. At first some info about my toolbox and rig:

  • GTX750ti (Maxwell GM107)
  • NSight with MSVC 2012
  • CUDA 7
  • x64 (does not change if i386 is chosen)
  • sm50

The concrete phenomenon I observe is the following: In debug mode, the program runs fine, in release mode the kernel does not terminate (i.e. it stays in an endless loop). After some analysis I stripped it down to the following sass lines:

        /*1188*/              @!P0 LDS.U.32 R0, [R6];
        /*1190*/              @!P0 IADD32I R0, R0, 0x1;
        /*1198*/              @!P0 STS [R6], R0;
        /*11a8*/                   LDS.U.32 R2, [R6];
        /*11b0*/                   ISETP.GT.U32.AND P1, PT, R2, 0x65, PT;
        /*11b8*/               @P1 EXIT;
        /*11c8*/                   ISETP.GT.U32.AND P1, PT, R14, 0x2, PT;
        /*11d0*/               @P1 BRA (.L_79);
        /*11d8*/                   SYNC 
        /*11e8*/         {         MOV32I R2, 32@lo(_ZN4CUDA20NumericalIntegration16_QuadraturModuleE);
        /*11f0*/                   LDS.U.32 R0, [R6];        }
        /*11f8*/                   MOV32I R3, 32@hi(_ZN4CUDA20NumericalIntegration16_QuadraturModuleE);

What you can see is a compiled loop. Okay so what happens: I start with a single 32 thread warp, all threads are active. R0 is loaded with a value from shared memory (which is 0 at the beginning). This value is incremented by the first thread and afterward checked against 0x65 by all threads, which is the end of loop condition. The second ISETP is used to determine if the threadId is higher 2 - in this case the threads should be set to inactive (branch to loop start).
P1 is true for the first 3 threads (0, 1, 2) but after executing line 9. I get the following warning from NSight: “Stepper Failed: Trying to step invalid warp”. After that, threads 0…2 are set to inactive and do not resume anymore. The result is an endless loop (as counter is not incremented anymore).
When using a GTX650 (sm30) the code works as expected. I already tried updating the driver to the current one, but this did not make any difference.

Any help kindly appreciated

The code appears to be calling syncthreads in a location containing thread divergence which is not allowed in the current programming model. Can you post the CUDA C source code.

Please find the corresponding code below.

#define LEADING_THREAD      (threadIdx.x == 0)
#define BI_WARP_ID           threadIdx.y

__shared__ volatile unsigned int jobID[INTEGRALS_PER_BLOCK];

        jobID[BI_WARP_ID] += 1;              

    if(jobID[BI_WARP_ID] >= 101)
    if(threadIdx.x < 3) 
        _QuadraturModule.WriteIntegralValues(jobID[BI_WARP_ID], 1);

This is it pretty much - please note this is a totally stripped debug version without any reasonable functionality, just to isolate the error. In the called function WriteIntegralValues() values (i.e. 1) are written to global memory. If you need any further info please let me know.

Thanks for your help!