I assume that there is no bank conflict in this case because data at location 0 will be broadcasted to thread 0, 4, 8, 12, 16, 20, 24, 28.
But profiling result shows that there are 4 requests generated for each LDS.U.128 instruction. (I expect 2 requests.)
On CC 7.0 - 7.5 devices shared memory loads with uniform addresses can increase bandwidth if the following is true:
Thread pairs (Tn and Tn^1) have the same addresses for all active threads (i.e. T0==T1, T2==T3, T4==T5, T6==T7, etc.), or
Thread pairs (Tn and Tn^2) have the same addresses for all active threads (i.e. T0==T2, T1==T3, T4==T6, T5==T7, etc.)
(Note these encompass the case where all active threads have the same address)
The return bandwidth of L1 and shared memory is 128 bytes/cycle == 1 register/cycle. The uniform access can achieve 2x the performance by packing the return registers. In the case above the rules are not met so LDS.128 requires 4 requests per instruction as it returns 4 registers. If the guidelines are met the requests/instruction should reduce from 4 to 2.