Hi all. Despite the compulsive reading of the manual, I’m not sure I’m understanding
the subtleties of memory access…
I would like to know if the following scenario makes sense.
Assume that every thread of a block needs a working array of 100 integers.
(in practice I do not really “share” anything among threads…)
Since 100432 = 16384/2 < 12800 < 16384 (available shared memory),
I decide to use 32 threads per block.
If I understood the manual, the 32-bit words are assigned cyclically to
banks, so, if I have
shared int Data[100*32];
Data, Data, Data,… belong to Bank 0,
Data, Data, Data, belong to Bank 1
Data, Data, and so on belong to Bank 15.
Then, to allow each thread to work freely on its array
without conflicts with other threads, I should have each thread of each half-warp
to work on a different Bank (?).
Would something like this work:
#define V(x) Data[threadIdx.x + x*32]
to use V(0)… V(99) in each thread as a local array ?
In this way, if I’m right, thread 0 will work only on bank 0,
thread 1 will work only on bank 1 and so on, and the
same for the second half-warp, i.e., thread 16 will work on bank 0
and so on. Is it correct?