Long latency operations

I tried some experimentation with the following sass code.

BFE.U32 R0, R0, 0x708;
ISETP.EQ.AND P0, PT, R0, 0x0, PT;
MOV R2, c[0x0][0x140];
MOV R3, c[0x0][0x144];
CS2R R10, SR_CLOCKLO; // clock_0
@P0 LDG.E RZ, [R2];
CS2R R11, SR_CLOCKLO; // clock_1
@P0 DEPBAR.LE SB0, 0x0;
CS2R R13, SR_CLOCKLO; // clock_2

On multiple warps (from 2 to 8), the first clock measurements are following across warps (3 cycles apart). The second measurements are also few cycles apart between warps. But, the third measurements are a bit odd to me. The first one (warp 0) occurs much earlier than the orthers like follow ( i computed deltas between clock 0 and 1 and clock 1 and 2).

warp 0 : 8, 38
warp 1 : 9, 149
warp 2 : 9, 149
warp 3 : 8, 152

My question is, why the warps 1 to 3 deltas are greater than the first warp thought they are not executing memory access ?

I would like to precise, that i am using a jetson Tx2. So, it is the Pascal micro-architecture.

On Maxwell/Pascal fully predicated off L1/TEX instructions are dispatched to the L1/TEX unit as all operations to L1/TEX must complete in order. Fully predicated instruction will generate a bubble in the pipeline. This uses significantly less cycles than an instruction with at least 1 thread predicated on. The instruction will pay the penalty for any warp instruction that misses in L1TEX prior to it…

How are you generating the SASS? There is not enough information in the disassembly above to know if it is correct.

I use maxas (GitHub - NervanaSystems/maxas: Assembler for NVIDIA Maxwell architecture) to generate the SASS. If you want, i can provide you the full SASS code. What kind of information do you need ?

If a prior instruction issued non-predicated instruction to the L1/TEX and then wait for completion on another instruction (here DEPBAR.LE SB0, 0x0). So, the other predicated warps are going to block on DEPBAR until the first non predicated warp finished its access ?

If that statement is correct, why the second delta of the first warp is shorter that the others ?
Because, when all warp has sent their instructions into the L1/TEX unit, they blocked on DEBAR. When the first load is done the first warp can execute the clock instruction. And other should follow has fast as possible since the other warp are fully predicated.