When you store (or load) more than 4 bytes per thread, which is like saying more than 128 bytes per warp, the GPU does not issue a single transaction. The largest transaction size is 128 bytes. If you request 16 bytes per thread, then warp wide that will be a total of 512 bytes per request (warp-wide). The GPU will break that up into 4 transactions (in that case: T0-T7 make up a transaction, T8-T15 are a transaction, and so on), each of which is 128 bytes wide. The determination of bank conflicts is made per transaction, not per request or per warp or per instruction.
The second case is identical to the first in this respect. Considering just the threads 0 to 7, or just the threads 8-15, and the transaction associated with each, there is no bank conflict.
In the 3rd case, the request across the warp will be broken up the same way: threads 0-7 will constitute one transaction. And when we look at the activity for those threads, we see that for example threads 0-3 are writing to the same column(s). So we expect 4-way bank conflicts there.