Hi all,
I am processing some data on a matrix of width “width” using a 1D array (data), whose access pattern is k = width * blockIdx.y + blockIdx.x * NUM_THREADS + threadIdx.x. I allocate floats X(i) and assign the data of the array as X(i) = data(k). Y(i) are also arrays, whose size corresponds to the number of threads.
shared float X1;
shared float X2;
shared float X3;
shared float X4;
shared float X5;
shared float X6;
shared float X7;
shared float X8;
shared float X9;
shared float Y1[NUM_THREADS];
shared float Y2[NUM_THREADS];
shared float Y3[NUM_THREADS];
shared float Y4[NUM_THREADS];
shared float Y5[NUM_THREADS];
shared float Y6[NUM_THREADS];
shared float Y7[NUM_THREADS];
shared float Y8[NUM_THREADS];
shared float Y9[NUM_THREADS];
after processing I want to pass a float X(i) to Y(i) at position threadIdx.x: Something like:
Y1[threadIdx.x + 1] = X1;
…
Yi[threadIdx.x - 2] = Xi;
(note that +1 or -2 are simply shifts). I am facing banck conflict, because my hardware is compute capability 2.1 (L1/L2 cache I guess). Could please someone point me to the right way to access Yi avoiding bank conflict? Thanks!