Bank conflicts on same address

Section 5.1.2.5 of the CUDA 2.1 Programming Guide states "“any memory read or write request made of n addresses that fall in n distinct memory banks can be serviced simultaneously”. Does a bank conflict occur if multiple threads read from two different addresses? For example, does a conflict occur if threads 0 and 1 read from foo[0] and threads 2 and 3 read from foo[1]? This is only two unique addresses (foo[0] and foo[1]) and each address falls in a different bank (assuming foo is an array of ints).

of course, Bank conflict occurs. because thread 0, 1 and thread 2, 3 are the same halfwarp.

Thread 0, 1 access same bank (which allocate foo[0]), and Thread 2, 3 access same bank (which allocate foo[1]).

But the four threads read a total of two addresses from two banks (n=2). Threads 0 and 1 both read foo[0] (the same address) and threads 2 and 3 both read foo[1] (the same address). Why does it cause a bank conflict if the same exact address is being read? (I understand why a conflict would occur if multiple threads read from two or more addresses in the same 32-bit word. This is the reason for the broadcast mechanism.)

See figure 5.8 in the same manual for your answer - assuming the other threads in the half-warp don’t cause conflicts then there will be no conflicts.

I’m not sure that either diagram (left or right) in figure 5.8 depicts the scenario I am describing. In the left diagram, all threads in the half warp read from the same 32 bit word (a single word). As a result, the broadcast mechanism can service all of the requests at once. The right diagram depicts a scenario in which there is one set of threads reading from the same word, and all other threads read from words in different banks. The guide states that which word is selected by the broadcast mechanism is unspecified and can’t be predetermined. As a result, the scenario will cause a bank conflict if bank 5 is not selected as the broadcast word.

Both of these scenarios are different than what I am describing. My scenario would look like:

thread 0 → foo[0] (bank 0)

thread 1 /

thread 2 \

thread 3 → foo[1] (bank 1)

For easy to understand.

Thread0 accesses bank0 (foo[0]), and this time thread1 wants to access bank0 (foo[0]) too.

with this situation, Cuda driver allows thread0 access bank0 and thread1 must wait until thread0 is finish → warp_serial occurs (bank_conflict).

Same for thread2 and thread3.