I’m trying to write a simple kernel with a 2D array in the shared memory, but probably I didn’t have understood very well some basic principle.
The array is defined as
assuming that the array is linearized in the shared memory is such a way that the first element is  and the following is ,
my idea was that each thread in an half-warp has access to different bank in the shared memory regardless to the value of the first index.
For instance if the thread 2 wants to write 2 in bin=5
thread 2 => array = 2
this have no conflict with any other of these concurrent operations in the same half warp
thread 0 => array = 4
thread 1 => array = 2
thread 5 => array = 1
thread 15 => array = 8
because, I suppose, the position in the bank is defined by the faster index.
In any case, apparently, something is wrong because if I put the bin index equal to some fix value then everthings works , in the sense that I haven’t warp serialized in the profiler and very good
performance, but if I put for the bin index the correct value according to the algorithm of my kernel then the result is quite worst (many warp serialized and poor performance).
Could you hel pme to understand what is wrong in this approach?