$R2 value changed to lane ID after __threadfence_system()

$R2 is used for high address(most significant 32 bits) of local memory pointer.
I’ve check whole function, that $R2 is not modified in whole function.

Encountered this issue by hitting random “CUDA Exception: Warp Illegal Address”
Narrowed it down to following code that can stably trigger the issue.

$R2 value being modified to lane-id after __threadfence_system(), causing first access to local memory variable hit illegal address.
As showed on following code, the exception always triggered on myfunc+0xadd8, which is the first access to local memory, whose address stores in R2/R36 and R2 changed to invalid value. NOTHING in the code modifies R2, and it was just accessed before calling __threadfence_system() with correct value(without error).

My question:

  • Is R2 anything SPECIAL ?
  • Is it a known issue, and is there a workaround to avoid this issue?

Please don’t hesitate to ask for more information if required.
Thank you for anyone who can help.

=================================C code================================

__device__ __noinline__
void myfunc(T1 *v1, T2 *v2) {
  volatile T3 * volatile ptr = NULL;
  ..... 
  while (long_running_condition) {
    do something...
    //My debug code:
    ++ptr;
    --ptr;
    __threadfence_system();
    ++ptr;
    --ptr;
  }
}

==================================PTX code=============================

.func __threadfence_system(    )
{
  membar.sys;
  ret;
}
///PATH/file.cu:556              ++ptr;
        .loc    1 556 3
        ld.volatile.u64         %rd199, [%SP+40];
        add.s64         %rd200, %rd199, 32;
        st.volatile.u64         [%SP+40], %rd200;

///PATH/file.cu:557              --ptr;
        .loc    1 557 3
        ld.volatile.u64         %rd201, [%SP+40];
        add.s64         %rd202, %rd201, -32;
        st.volatile.u64         [%SP+40], %rd202;

///PATH/file.cu:563              __threadfence_system();
        .loc    1 563 3
        // Callseq Start 31
        {
        .reg .b32 temp_param_reg;
        // <end>}
        call.uni 
        __threadfence_system, 
        (
        );
        
        //{
        }// Callseq End 31

///PATH/file.cu:566              ++ptr;
        .loc    1 566 3
        ld.volatile.u64         %rd203, [%SP+40];
        add.s64         %rd204, %rd203, 32;
        st.volatile.u64         [%SP+40], %rd204;

///PATH/file.cu:567              --ptr;
        .loc    1 567 3
        ld.volatile.u64         %rd205, [%SP+40];
        add.s64         %rd206, %rd205, -32;
        st.volatile.u64         [%SP+40], %rd206;

============================ Disassembled Code ==========================
// Generated by: nvdisasm -ndf -c -g -sf PATH/file.sm_61.cubin > PATH/file.sm_61.cubin.dump

//--------------------- .text.__threadfence_system --------------------------
        .section        .text.__threadfence_system,"ax",@progbits
        .sectioninfo    @"SHI_REGISTERS=2"
        .align  32

.text.__threadfence_system:
        .type           __threadfence_system,@function
        .size           __threadfence_system,(.L_1216 - __threadfence_system)
__threadfence_system:
        /*0008*/                   MEMBAR.SYS.IVALLD ;
        /*0010*/                   RET ;
.L_202: 
        /*0018*/                   BRA `(.L_202) ;
        /*0028*/                   NOP;
        /*0030*/                   NOP;
        /*0038*/                   NOP;
.L_1216:

//--------------------- .text.__MASKED_OUT_myfunc --------------------------
.section .text.__MASKED_OUT_myfunc,“ax”,@progbits
.sectioninfo @“SHI_REGISTERS=199”
.align 32

    .global         __MASKED_OUT_myfunc 
    .type           __MASKED_OUT_myfunc,@function
    .size           __MASKED_OUT_myfunc,(.L_425 - __MASKED_OUT_myfunc)

__MASKED_OUT_myfunc:
.text.__MASKED_OUT_myfunc:
//## File “/PATH/file.cu”, line 293
/0008/ IADD32I R1, R1, -0x200 ;
/0010/ S2R R0, SR_LMEMHIOFF ;
/0018/ ISETP.GE.U32.AND P0, PT, R1, R0, PT ;
/0028/ @ P0 BRA '(.L_43) ;
/0030/ BPT.TRAP 0x1 ;
.L_43:
/0038/ STL [R1+0x1fc], R198 ;
/0048/ STL [R1+0x1f8], R197 ;
/0050/ STL [R1+0x1f4], R196 ;
/0058/ STL [R1+0x1f0], R191 ;
/0068/ STL [R1+0x1ec], R190 ;
/0070/ STL [R1+0x1e8], R189 ;
/0078/ STL [R1+0x1e4], R188 ;

/0458/ STL [R1+0x70], R2 ;
/0468/ MOV R7, R7 ;
/0470/ MOV R6, R6 ;
/0478/ MOV R5, R5 ;
/0488/ MOV R4, R4 ;
/0490/ IADD R0, R1, RZ ;
/0498/ I2I.U32.U32 R0, R0 ;
/04a8/ MOV R8, R0 ;
/04b0/ MOV R9, RZ ;
/04b8/ MOV R0, R8 ;
/04c8/ MOV R9, R9 ;
/04d0/ MOV R8, R0 ;
/04d8/ MOV R9, R9 ;
/04e8/ MOV R0, c[0x0][0x4] ;
/04f0/ MOV R3, c[0x0][0x104] ;
/04f8/ IADD R0.CC, R8, R0 ;
/0508/ IADD.X R3, R9, R3 ;
/0510/ MOV R4, R4 ;
/0518/ MOV R5, R5 ;
/0528/ MOV R6, R6 ;
/0530/ MOV R7, R7 ;
/0538/ MOV R36, R0 ;
/0548/ MOV R2, R3 ;

//## File “/PATH/file.cu”, line 556
/ab78/ IADD32I R5.CC, R36, 0x28 ;
/ab88/ IADD.X R0, R2, RZ ;
/ab90/ MOV R5, R5 ;
/ab98/ MOV R0, R0 ;
/aba8/ LEA R4.CC, R5, RZ ;
/abb0/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/abb8/ MOV R4, R4 ;
/abc8/ MOV R5, R5 ;
/abd0/ LD.E.CV.64 R4, [R4], P0 ;
/abd8/ MOV R6, R4 ;
/abe8/ MOV R7, R5 ;
/abf0/ IADD32I R6.CC, R6, 0x20 ;
/abf8/ IADD.X R7, R7, RZ ;
/ac08/ IADD32I R5.CC, R36, 0x28 ;
/ac10/ IADD.X R0, R2, RZ ;
/ac18/ MOV R5, R5 ;
/ac28/ MOV R0, R0 ;
/ac30/ LEA R4.CC, R5, RZ ;
/ac38/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/ac48/ MOV R6, R6 ;
/ac50/ MOV R7, R7 ;
/ac58/ MOV R4, R4 ;
/ac68/ MOV R5, R5 ;
/ac70/ ST.E.WT.64 [R4], R6, P0 ;
//## File “/PATH/file.cu”, line 557
/ac78/ IADD32I R5.CC, R36, 0x28 ;
/ac88/ IADD.X R0, R2, RZ ;
/ac90/ MOV R5, R5 ;
/ac98/ MOV R0, R0 ;
/aca8/ LEA R4.CC, R5, RZ ;
/acb0/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/acb8/ MOV R4, R4 ;
/acc8/ MOV R5, R5 ;
/acd0/ LD.E.CV.64 R4, [R4], P0 ;
/acd8/ MOV R6, R4 ;
/ace8/ MOV R7, R5 ;
/acf0/ IADD32I R6.CC, R6, -0x20 ;
/acf8/ IADD32I.X R7, R7, -0x1 ;
/ad08/ IADD32I R5.CC, R36, 0x28 ;
/ad10/ IADD.X R0, R2, RZ ;
/ad18/ MOV R5, R5 ;
/ad28/ MOV R0, R0 ;
/ad30/ LEA R4.CC, R5, RZ ;
/ad38/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/ad48/ MOV R6, R6 ;
/ad50/ MOV R7, R7 ;
/ad58/ MOV R4, R4 ;
/ad68/ MOV R5, R5 ;
/ad70/ ST.E.WT.64 [R4], R6, P0 ;
//## File “/PATH/file.cu”, line 563
/ad78/ JCAL '(__threadfence_system) ;
//## File “/PATH/file.cu”, line 566
/ad88/ IADD32I R5.CC, R36, 0x28 ;
/ad90/ IADD.X R0, R2, RZ ;
/ad98/ MOV R5, R5 ;
/ada8/ MOV R0, R0 ;
/adb0/ LEA R4.CC, R5, RZ ;
/adb8/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/adc8/ MOV R4, R4 ;
/add0/ MOV R5, R5 ;
/add8/ LD.E.CV.64 R4, [R4], P0 ;
/ade8/ MOV R6, R4 ;
/adf0/ MOV R7, R5 ;
/adf8/ IADD32I R6.CC, R6, 0x20 ;
/ae08/ IADD.X R7, R7, RZ ;
/ae10/ IADD32I R5.CC, R36, 0x28 ;
/ae18/ IADD.X R0, R2, RZ ;
/ae28/ MOV R5, R5 ;
/ae30/ MOV R0, R0 ;
/ae38/ LEA R4.CC, R5, RZ ;
/ae48/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/ae50/ MOV R6, R6 ;
/ae58/ MOV R7, R7 ;
/ae68/ MOV R4, R4 ;
/ae70/ MOV R5, R5 ;
/ae78/ ST.E.WT.64 [R4], R6, P0 ;
//## File “/PATH/file.cu”, line 567
/ae88/ IADD32I R5.CC, R36, 0x28 ;
/ae90/ IADD.X R0, R2, RZ ;
/ae98/ MOV R5, R5 ;
/aea8/ MOV R0, R0 ;
/aeb0/ LEA R4.CC, R5, RZ ;
/aeb8/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/aec8/ MOV R4, R4 ;
/aed0/ MOV R5, R5 ;
/aed8/ LD.E.CV.64 R4, [R4], P0 ;
/aee8/ MOV R6, R4 ;
/aef0/ MOV R7, R5 ;
/aef8/ IADD32I R6.CC, R6, -0x20 ;
/af08/ IADD32I.X R7, R7, -0x1 ;
/af10/ IADD32I R5.CC, R36, 0x28 ;
/af18/ IADD.X R0, R2, RZ ;
/af28/ MOV R5, R5 ;
/af30/ MOV R0, R0 ;
/af38/ LEA R4.CC, R5, RZ ;
/af48/ LEA.HI.X P0, R5, R5, RZ, R0 ;
/af50/ MOV R6, R6 ;
/af58/ MOV R7, R7 ;
/af68/ MOV R4, R4 ;
/af70/ MOV R5, R5 ;
/af78/ ST.E.WT.64 [R4], R6, P0 ;

==================== cuda-gdb ==================

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x55555652fff8 (file.cu:566)

Thread 1 “PROGRAM” received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (0,0,0), device 0, sm 0, warp 3, lane 0]
0x0000555556530008 in myfunc (v1=0x7fff91c21000, v2=0x7fffce30a000) at /PATH/file.cu:566
566 ++ptr;
(cuda-gdb) output myfunc
{void (T1 * @generic, T2 * @generic)} 0x555556525220 <myfunc(T1*, T2*)>(cuda-gdb) p/x (0x55555652fff8-0x555556525220)
$1 = 0xadd8
(cuda-gdb) output &ptr
(volatile @generic T3 * volatile @local *) 0xfffb58
(cuda-gdb) p/x $R2
$2 = 0x0
(cuda-gdb) p/x $R36
$3 = 0xf3fffb30
(cuda-gdb) set $lane=0
(cuda-gdb) while ($lane < 32)

   cuda lane $lane
   p/x $R2
   set $lane=$lane+1

end
CUDA focus unchanged.
$68 = 0x0
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (1,0,0), device 0, sm 0, warp 3, lane 1]
0x0000555556530008 566 ++ptr;
$69 = 0x1
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (2,0,0), device 0, sm 0, warp 3, lane 2]
0x0000555556530008 566 ++ptr;
$70 = 0x2
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (3,0,0), device 0, sm 0, warp 3, lane 3]
0x0000555556530008 566 ++ptr;
$71 = 0x3
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (4,0,0), device 0, sm 0, warp 3, lane 4]
0x0000555556530008 566 ++ptr;
$72 = 0x4
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (5,0,0), device 0, sm 0, warp 3, lane 5]
0x0000555556530008 566 ++ptr;
$73 = 0x5
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (6,0,0), device 0, sm 0, warp 3, lane 6]
0x0000555556530008 566 ++ptr;
$74 = 0x6
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (7,0,0), device 0, sm 0, warp 3, lane 7]
0x0000555556530008 566 ++ptr;
$75 = 0x7
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (8,0,0), device 0, sm 0, warp 3, lane 8]
0x0000555556530008 566 ++ptr;
$76 = 0x8
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (9,0,0), device 0, sm 0, warp 3, lane 9]
0x0000555556530008 566 ++ptr;
$77 = 0x9
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (10,0,0), device 0, sm 0, warp 3, lane 10]
0x0000555556530008 566 ++ptr;
$78 = 0xa
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (11,0,0), device 0, sm 0, warp 3, lane 11]
0x0000555556530008 566 ++ptr;
$79 = 0xb
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (12,0,0), device 0, sm 0, warp 3, lane 12]
0x0000555556530008 566 ++ptr;
$80 = 0xc
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (13,0,0), device 0, sm 0, warp 3, lane 13]
0x0000555556530008 566 ++ptr;
$81 = 0xd
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (14,0,0), device 0, sm 0, warp 3, lane 14]
0x0000555556530008 566 ++ptr;
$82 = 0xe
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (15,0,0), device 0, sm 0, warp 3, lane 15]
0x0000555556530008 566 ++ptr;
$83 = 0xf
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (16,0,0), device 0, sm 0, warp 3, lane 16]
0x0000555556530008 566 ++ptr;
$84 = 0x10
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (17,0,0), device 0, sm 0, warp 3, lane 17]
0x0000555556530008 566 ++ptr;
$85 = 0x11
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (18,0,0), device 0, sm 0, warp 3, lane 18]
0x0000555556530008 566 ++ptr;
$86 = 0x12
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (19,0,0), device 0, sm 0, warp 3, lane 19]
0x0000555556530008 566 ++ptr;
$87 = 0x13
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (20,0,0), device 0, sm 0, warp 3, lane 20]
0x0000555556530008 566 ++ptr;
$88 = 0x14
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (21,0,0), device 0, sm 0, warp 3, lane 21]
0x0000555556530008 566 ++ptr;
$89 = 0x15
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (22,0,0), device 0, sm 0, warp 3, lane 22]
0x0000555556530008 566 ++ptr;
$90 = 0x16
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (23,0,0), device 0, sm 0, warp 3, lane 23]
0x0000555556530008 566 ++ptr;
$91 = 0x17
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (24,0,0), device 0, sm 0, warp 3, lane 24]
0x0000555556530008 566 ++ptr;
$92 = 0x18
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (25,0,0), device 0, sm 0, warp 3, lane 25]
0x0000555556530008 566 ++ptr;
$93 = 0x19
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (26,0,0), device 0, sm 0, warp 3, lane 26]
0x0000555556530008 566 ++ptr;
$94 = 0x1a
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (27,0,0), device 0, sm 0, warp 3, lane 27]
0x0000555556530008 566 ++ptr;
$95 = 0x1b
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (28,0,0), device 0, sm 0, warp 3, lane 28]
0x0000555556530008 566 ++ptr;
$96 = 0x1c
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (29,0,0), device 0, sm 0, warp 3, lane 29]
0x0000555556530008 566 ++ptr;
$97 = 0x1d
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (30,0,0), device 0, sm 0, warp 3, lane 30]
0x0000555556530008 566 ++ptr;
$98 = 0x1e
[Switching focus to CUDA kernel 0, grid 2, block (8,0,0), thread (31,0,0), device 0, sm 0, warp 3, lane 31]
0x0000555556530008 566 ++ptr;
$99 = 0x1f

/// CUDA Block that does not hit this issue yet:
(cuda-gdb) cuda block 14
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (31,0,0), device 0, sm 6, warp 2, lane 31]
0x000055555652fef8 557 --ptr;
(cuda-gdb) set $lane=0
(cuda-gdb) while ($lane < 32)

   cuda lane $lane 
   p/x $R2
   set $lane=$lane+1

end
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (0,0,0), device 0, sm 6, warp 2, lane 0]
0x000055555652fef8 557 --ptr;
$100 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (1,0,0), device 0, sm 6, warp 2, lane 1]
0x000055555652fef8 557 --ptr;
$101 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (2,0,0), device 0, sm 6, warp 2, lane 2]
0x000055555652fef8 557 --ptr;
$102 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (3,0,0), device 0, sm 6, warp 2, lane 3]
0x000055555652fef8 557 --ptr;
$103 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (4,0,0), device 0, sm 6, warp 2, lane 4]
0x000055555652fef8 557 --ptr;
$104 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (5,0,0), device 0, sm 6, warp 2, lane 5]
0x000055555652fef8 557 --ptr;
$105 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (6,0,0), device 0, sm 6, warp 2, lane 6]
0x000055555652fef8 557 --ptr;
$106 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (7,0,0), device 0, sm 6, warp 2, lane 7]
0x000055555652fef8 557 --ptr;
$107 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (8,0,0), device 0, sm 6, warp 2, lane 8]
0x000055555652fef8 557 --ptr;
$108 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (9,0,0), device 0, sm 6, warp 2, lane 9]
0x000055555652fef8 557 --ptr;
$109 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (10,0,0), device 0, sm 6, warp 2, lane 10]
0x000055555652fef8 557 --ptr;
$110 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (11,0,0), device 0, sm 6, warp 2, lane 11]
0x000055555652fef8 557 --ptr;
$111 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (12,0,0), device 0, sm 6, warp 2, lane 12]
0x000055555652fef8 557 --ptr;
$112 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (13,0,0), device 0, sm 6, warp 2, lane 13]
0x000055555652fef8 557 --ptr;
$113 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (14,0,0), device 0, sm 6, warp 2, lane 14]
0x000055555652fef8 557 --ptr;
$114 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (15,0,0), device 0, sm 6, warp 2, lane 15]
0x000055555652fef8 557 --ptr;
$115 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (16,0,0), device 0, sm 6, warp 2, lane 16]
0x000055555652fef8 557 --ptr;
$116 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (17,0,0), device 0, sm 6, warp 2, lane 17]
0x000055555652fef8 557 --ptr;
$117 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (18,0,0), device 0, sm 6, warp 2, lane 18]
0x000055555652fef8 557 --ptr;
$118 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (19,0,0), device 0, sm 6, warp 2, lane 19]
0x000055555652fef8 557 --ptr;
$119 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (20,0,0), device 0, sm 6, warp 2, lane 20]
0x000055555652fef8 557 --ptr;
$120 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (21,0,0), device 0, sm 6, warp 2, lane 21]
0x000055555652fef8 557 --ptr;
$121 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (22,0,0), device 0, sm 6, warp 2, lane 22]
0x000055555652fef8 557 --ptr;
$122 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (23,0,0), device 0, sm 6, warp 2, lane 23]
0x000055555652fef8 557 --ptr;
$123 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (24,0,0), device 0, sm 6, warp 2, lane 24]
0x000055555652fef8 557 --ptr;
$124 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (25,0,0), device 0, sm 6, warp 2, lane 25]
0x000055555652fef8 557 --ptr;
$125 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (26,0,0), device 0, sm 6, warp 2, lane 26]
0x000055555652fef8 557 --ptr;
$126 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (27,0,0), device 0, sm 6, warp 2, lane 27]
0x000055555652fef8 557 --ptr;
$127 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (28,0,0), device 0, sm 6, warp 2, lane 28]
0x000055555652fef8 557 --ptr;
$128 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (29,0,0), device 0, sm 6, warp 2, lane 29]
0x000055555652fef8 557 --ptr;
$129 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (30,0,0), device 0, sm 6, warp 2, lane 30]
0x000055555652fef8 557 --ptr;
$130 = 0x7fff
[Switching focus to CUDA kernel 0, grid 2, block (14,0,0), thread (31,0,0), device 0, sm 6, warp 2, lane 31]
0x000055555652fef8 557 --ptr;
$131 = 0x7fff

=================== Device Query ===============

$ sudo /opt/cuda/extras/demo_suite/deviceQuery
/opt/cuda/extras/demo_suite/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Quadro P2000”
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 5059 MBytes (5304745984 bytes)
( 8) Multiprocessors, (128) CUDA Cores/MP: 1024 CUDA Cores
GPU Max Clock rate: 1481 MHz (1.48 GHz)
Memory Clock rate: 3504 Mhz
Memory Bus Width: 160-bit
L2 Cache Size: 1310720 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 94 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1, Device0 = Quadro P2000
Result = PASS