Unexpected shared memory bank conflict.

On CC 7.0 - 7.5 devices shared memory loads with uniform addresses can increase bandwidth if the following is true:

  • Thread pairs (Tn and Tn^1) have the same addresses for all active threads (i.e. T0==T1, T2==T3, T4==T5, T6==T7, etc.), or
  • Thread pairs (Tn and Tn^2) have the same addresses for all active threads (i.e. T0==T2, T1==T3, T4==T6, T5==T7, etc.)
  • (Note these encompass the case where all active threads have the same address)

The return bandwidth of L1 and shared memory is 128 bytes/cycle == 1 register/cycle. The uniform access can achieve 2x the performance by packing the return registers. In the case above the rules are not met so LDS.128 requires 4 requests per instruction as it returns 4 registers. If the guidelines are met the requests/instruction should reduce from 4 to 2.