bank conflict in cuda's parallel prefix scan

I have a quick question regarding the “Parallel Prefix Scan with CUDA” paper. In it, you state briefly that “when multiple threads in the same warp access the same bank , a bank conflict occurs, unless all threads access the same address within the 32 bit word”.

From the website , "Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. "

i am a little confused by this statement. Is an address not 32 bit on a 32 bit machine and 64 bits on a 64 bit machine. So, if the kernel had shared memory and each thread in the warp (the warp has 32 threads on my machine) was accessing a different address, it would be accessing an address on its very own bank and therefore there should be no conflict.

Am i missing something here?Unfortunately, i cannot make sense of the cuda docs to figure this one out… Please advise.

memory addressing is done by bytes. 0 is byte 0, 1 is byte 1.

For 32-bit quantities, the first quantity would normally be located at address 0 (for a naturally aligned word), and the second quantity would be located at address 4.

Therefore a 32-bit word “contains” 4 addresses. For the 32-bit word at address 0, it “contains” byte addresses of 0,1,2, and 3.

This is the meaning of “the same address within a 32-bit word”

Bank 0 includes the byte addresses of 0,1,2,3, as well as 128,129,130, and 131, etc.
Bank 1 includes the byte addresses of 4,5,6,7, as well as 132, 133, 134, and 135, etc.

If thread 0 is accessing address 0 and thread 1 is accessing address 4, then there will be no bank conflicts between those two threads in that case.

If thread 0 is accessing address 0 and thread 1 is accessing address 132, then there will be no bank conflicts.

If thread 0 is accessing address 0 and thread 1 is accessing address 128, there will be a bank conflict.

I’m not sure what the “Parallel Prefix Scan with CUDA” paper is, but I would refer to the programming guide for shared memory details:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory-2-x