the section G.4.3 of CUDA programming guide says that FERMI’s shared memory is designed to specifically handle bank conflicts of 64 and 128-bit accesses. Although in 2.0 Fermi device, shared memory access is handle per warp (32 thread) and each bank is 32-bit width so I don’t know how could it handle bank conflict free in 64 bit access and 2-way conflict in 128 bit access. From my naive point of view, each 64 bit word is mapped to 2 bank, so when a warp of 32 threads access the shared memory it would cause at least 2-way bank conflict in 64-bit access. Can anyone here explain how does it work to me? And also how the 64-bit word and 128-bit(i.e. int2 and int4 vector) word are stored into the bank?