handle bank conflicts on shared memory of Fermi devices? How does the hardware work

Rotor · November 15, 2010, 12:03am

Hi all,

the section G.4.3 of CUDA programming guide says that FERMI’s shared memory is designed to specifically handle bank conflicts of 64 and 128-bit accesses. Although in 2.0 Fermi device, shared memory access is handle per warp (32 thread) and each bank is 32-bit width so I don’t know how could it handle bank conflict free in 64 bit access and 2-way conflict in 128 bit access. From my naive point of view, each 64 bit word is mapped to 2 bank, so when a warp of 32 threads access the shared memory it would cause at least 2-way bank conflict in 64-bit access. Can anyone here explain how does it work to me? And also how the 64-bit word and 128-bit(i.e. int2 and int4 vector) word are stored into the bank?

Thanks,
Roto

Rotor · November 15, 2010, 12:03am

Hi all,

the section G.4.3 of CUDA programming guide says that FERMI’s shared memory is designed to specifically handle bank conflicts of 64 and 128-bit accesses. Although in 2.0 Fermi device, shared memory access is handle per warp (32 thread) and each bank is 32-bit width so I don’t know how could it handle bank conflict free in 64 bit access and 2-way conflict in 128 bit access. From my naive point of view, each 64 bit word is mapped to 2 bank, so when a warp of 32 threads access the shared memory it would cause at least 2-way bank conflict in 64-bit access. Can anyone here explain how does it work to me? And also how the 64-bit word and 128-bit(i.e. int2 and int4 vector) word are stored into the bank?

Thanks,
Roto

LSChien · November 15, 2010, 2:21am

Fermi has two warp schedulers, one is for odd warp, the other is for even warp. (some calls this as dual-issue).

so bank-conflict occurs when two threads in a half-warp map to the same bank but different locations.

Remember GPU is a vector machine with length 32 (warp size), but memory access is based on half-warp.

LSChien · November 15, 2010, 2:21am

Fermi has two warp schedulers, one is for odd warp, the other is for even warp. (some calls this as dual-issue).

so bank-conflict occurs when two threads in a half-warp map to the same bank but different locations.

Remember GPU is a vector machine with length 32 (warp size), but memory access is based on half-warp.

Rotor · November 15, 2010, 3:53am

Hi Chien,

Thanks for the explanation. Through Fermi’s architecture I see there are 16 load/store unit per multiprocessor (32 ALU unit) so that may be why it process memory request per half-warp (16). However the doc also says that shared memory request is processed “per warp” → short of confusing to me. By the way, could you please give me more details of the relation ship between the dual-warp scheduler and the bank conflict?

Thx,

Rotor

Rotor · November 15, 2010, 3:53am