handle bank conflicts on shared memory of Fermi devices? How does the hardware work

Hi all,

the section G.4.3 of CUDA programming guide says that FERMI’s shared memory is designed to specifically handle bank conflicts of 64 and 128-bit accesses. Although in 2.0 Fermi device, shared memory access is handle per warp (32 thread) and each bank is 32-bit width so I don’t know how could it handle bank conflict free in 64 bit access and 2-way conflict in 128 bit access. From my naive point of view, each 64 bit word is mapped to 2 bank, so when a warp of 32 threads access the shared memory it would cause at least 2-way bank conflict in 64-bit access. Can anyone here explain how does it work to me? And also how the 64-bit word and 128-bit(i.e. int2 and int4 vector) word are stored into the bank?

Thanks,
Roto

Hi all,

the section G.4.3 of CUDA programming guide says that FERMI’s shared memory is designed to specifically handle bank conflicts of 64 and 128-bit accesses. Although in 2.0 Fermi device, shared memory access is handle per warp (32 thread) and each bank is 32-bit width so I don’t know how could it handle bank conflict free in 64 bit access and 2-way conflict in 128 bit access. From my naive point of view, each 64 bit word is mapped to 2 bank, so when a warp of 32 threads access the shared memory it would cause at least 2-way bank conflict in 64-bit access. Can anyone here explain how does it work to me? And also how the 64-bit word and 128-bit(i.e. int2 and int4 vector) word are stored into the bank?

Thanks,
Roto

Fermi has two warp schedulers, one is for odd warp, the other is for even warp. (some calls this as dual-issue).

so bank-conflict occurs when two threads in a half-warp map to the same bank but different locations.

Remember GPU is a vector machine with length 32 (warp size), but memory access is based on half-warp.

Fermi has two warp schedulers, one is for odd warp, the other is for even warp. (some calls this as dual-issue).

so bank-conflict occurs when two threads in a half-warp map to the same bank but different locations.

Remember GPU is a vector machine with length 32 (warp size), but memory access is based on half-warp.

Hi Chien,

Thanks for the explanation. Through Fermi’s architecture I see there are 16 load/store unit per multiprocessor (32 ALU unit) so that may be why it process memory request per half-warp (16). However the doc also says that shared memory request is processed “per warp” → short of confusing to me. By the way, could you please give me more details of the relation ship between the dual-warp scheduler and the bank conflict?

Thx,

Rotor

Hi Chien,

Thanks for the explanation. Through Fermi’s architecture I see there are 16 load/store unit per multiprocessor (32 ALU unit) so that may be why it process memory request per half-warp (16). However the doc also says that shared memory request is processed “per warp” → short of confusing to me. By the way, could you please give me more details of the relation ship between the dual-warp scheduler and the bank conflict?

Thx,

Rotor