On multiple warps (from 2 to 8), the first clock measurements are following across warps (3 cycles apart). The second measurements are also few cycles apart between warps. But, the third measurements are a bit odd to me. The first one (warp 0) occurs much earlier than the orthers like follow ( i computed deltas between clock 0 and 1 and clock 1 and 2).
On Maxwell/Pascal fully predicated off L1/TEX instructions are dispatched to the L1/TEX unit as all operations to L1/TEX must complete in order. Fully predicated instruction will generate a bubble in the pipeline. This uses significantly less cycles than an instruction with at least 1 thread predicated on. The instruction will pay the penalty for any warp instruction that misses in L1TEX prior to it…
How are you generating the SASS? There is not enough information in the disassembly above to know if it is correct.
If a prior instruction issued non-predicated instruction to the L1/TEX and then wait for completion on another instruction (here DEPBAR.LE SB0, 0x0). So, the other predicated warps are going to block on DEPBAR until the first non predicated warp finished its access ?
If that statement is correct, why the second delta of the first warp is shorter that the others ?
Because, when all warp has sent their instructions into the L1/TEX unit, they blocked on DEBAR. When the first load is done the first warp can execute the clock instruction. And other should follow has fast as possible since the other warp are fully predicated.