Basic question on array in shared memory


I’m trying to write a simple kernel with a 2D array in the shared memory, but probably I didn’t have understood very well some basic principle.

The array is defined as
assuming that the array is linearized in the shared memory is such a way that the first element is [0][0] and the following is [0][1],
my idea was that each thread in an half-warp has access to different bank in the shared memory regardless to the value of the first index.
For instance if the thread 2 wants to write 2 in bin=5
thread 2 => array[5][2] = 2
this have no conflict with any other of these concurrent operations in the same half warp
thread 0 => array[5][0] = 4
thread 1 => array[2][1] = 2
thread 5 => array[4][5] = 1
thread 15 => array[13][15] = 8
because, I suppose, the position in the bank is defined by the faster index.

In any case, apparently, something is wrong because if I put the bin index equal to some fix value then everthings works , in the sense that I haven’t warp serialized in the profiler and very good
performance, but if I put for the bin index the correct value according to the algorithm of my kernel then the result is quite worst (many warp serialized and poor performance).
Could you hel pme to understand what is wrong in this approach?

Your array is of a 32 bit wide datatype? (like int or float)

yes, is an int array[16][128],

in my previous post I forgot to say that the first index (bin) depends on data, so it changes in each thread.

Anyway In my understanding, in this case, isn’t a problem due to the fact that the first index in my array runs always in the same shared memory bank

(for instance all the [bins] defined in the thread N correspond to memory location in bank N).

Just to be clear, because i’m not so sure that the scheme in my head is correct, let me make a sketch:

bank0 bank1 … bank15

[0][0] [0][1] …[0][15]

[0][16] … [0][31]

[0][112] … [0][127]

[1][0] [1][1] … [1][15]


in one single thread in an half-warp the bin index isn’t defined a priori, but in any case runs in the N column in the matrix above.

For Instance the first thread fill the element [0][16], the second [4][33], the third [0][18] and so on.

Is it correct?

Any idea?


As far as I can see your strategy is correct and should give you a conflict free access pattern.

Conclusion: Either your program is not reading your mind but rather reads the source - or the experimental version using non-data driven access was collapsed into something trivial and optimized away.

Mind you, moving stuff around isn’t free.


sure, my only doubt is that the profiler tell me that I have warp serialization and that this serialization go away if I put a fixed number as first index instead of the correct data dependent computed bin-index.

I’m little bit puzzled.

Thanks again for your answer.


Me too …

You are banking on the notion that any index having its 4 least significant bits equal to the thread should be conflict free, right? Very well, let’s say this is so:

int index = threadIdx.x & 0x0F;

… and then you have some input data that you calculate upon and sets the other bits. Clean up the result so that we know it really is sane by first taking away the 4 lower bits:

input &= ~0x0F;

… and then combining the two:

index |= input;

And for good measure while we are at it, restrict access to be within the 128 threads you actually have:

index &= 0x7F;

What else could go wrong? Yes you had a second index that should be restricted to the range 0 - 15 …

It is possible that your program will now produce the wrong results, but at least it should be conflict free :-D

Thank you very much for your nice suggestion, tomorrow I will try to implement it.
I have some doubt that the problem is in the first index indeed because I’m supposing that for the second index (depending on threadIdx.x) I have no problem.
Let me show to you my code, at least the critical parts:

if(threadIdx.x >= 32 && threadIdx.x < CALC_THREAD+32) {
int Idx_rel = (threadIdx.x - 32);

for (int idx=0;idx<nnevt;idx++) {
int idx_hit = Idx_evt32+idx;
float xhit = hits_d[idx_hit].x;
float yhit = hits_d[idx_hit].y;
float disx = sqrt((xhit-xcenterx)
int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS));
if (ndisx<16) {
int tmp_pre_histox = pre_histox[ndisx][Idx_rel];
pre_histox[ndisx][Idx_rel] = tmp_pre_histox + 1;

I’m supposing that Idx_rel is already conflict free depending linearly on threadIdx.x. With this code I have a lot of warp serialized and 8 ms of execution time
if I change int ndisx = ceilf() with int ndisx = 4 (for example), I have no warp serialized and 1 ms of execution time.
All the conflicts come from the writing in pre_histo and not in the reading. Indeed if I comment the writing row only I have no conflicts (in the sense that the tmp_pre_histox = pre_histox[ndisx][Idx_rel] appears to be conflicts free)
I’m making some strange stuff?

thx again,


I tried to implement your suggestion, changing 2D histograms with linerized 1D histogram, in the following way

[codebox] for (int idx=0;idx<nnevt;idx++) {

int idx_hit = Idx_evt*32+idx;

float xhit = hits_d[idx_hit].x;

float yhit = hits_d[idx_hit].y;

float disx = sqrt((xhit-xcenterx)*(xhit-xcenterx)+(yhit-ycenterx)*(yhit-ycenterx));

int ndisx = ceilf(disx/((float)(H_MAX-H_MIN)/NUM_BINS)); 

int index = threadIdx.x & 0x0F;

    int nndis = ndisx;

    nndis &= ~0x0F;

    index |= nndis;	

    if (ndisx<16) {

  int tmp_pre_histo1 = tmp_histo1[index];

            tmp_histo1[index] = tmp_pre_histo1 + 1;



but nothing change: the row withi the write in the shared memory is responsible of warp seriazlization while there is no problem with the read.

Thank you again for your patience,


Is it so that more than one thread could end up with the same index?

I think that element [0][0] is followed by [1][0] and later till [max-1][0] followed by [0][1]. Therefore to reduce back conflits I would try to access the array [lane][bin] rather than [bin][lane].

However to be sure on how memory is access I would use a 1D array with some multiplication/shifting…

Yes, in the 2D version, the bin index depending on ndisx could be the same between different thread in the same block, but the second index, that should determine the bank, is different among different threads.

In the 1D version, as you suggested, apparently it’s impossible because the last 4 bits are fixed by the thread index.


I dont know (according to some test seems to me that is [bin][lane], but at the moment I’m not completelly sure), in any case I tried to linerize (as in the code posted answering jma few posts above) but the problem is exactly the same:

bad performance if the bin index is computed from disx, very good performance if bin is fixed a any value (ndisx=4 for instance).

I’m going crazy!!! :)


I don’t have still solved my problem, but I wont to ask you a more general question:

If there is not conflict or divergence using shared memory, the reading time is equal (or similar) to the writing time? (assuming that there is not broadcasting in the read procedure)