Write/read shared memory on compute capability 2.1

Hi all,

I am processing some data on a matrix of width “width” using a 1D array (data), whose access pattern is k = width * blockIdx.y + blockIdx.x * NUM_THREADS + threadIdx.x. I allocate floats X(i) and assign the data of the array as X(i) = data(k). Y(i) are also arrays, whose size corresponds to the number of threads.

shared float X1;
shared float X2;
shared float X3;
shared float X4;
shared float X5;
shared float X6;
shared float X7;
shared float X8;
shared float X9;

shared float Y1[NUM_THREADS];
shared float Y2[NUM_THREADS];
shared float Y3[NUM_THREADS];
shared float Y4[NUM_THREADS];
shared float Y5[NUM_THREADS];
shared float Y6[NUM_THREADS];
shared float Y7[NUM_THREADS];
shared float Y8[NUM_THREADS];
shared float Y9[NUM_THREADS];

after processing I want to pass a float X(i) to Y(i) at position threadIdx.x: Something like:

Y1[threadIdx.x + 1] = X1;

Yi[threadIdx.x - 2] = Xi;

(note that +1 or -2 are simply shifts). I am facing banck conflict, because my hardware is compute capability 2.1 (L1/L2 cache I guess). Could please someone point me to the right way to access Yi avoiding bank conflict? Thanks!

If you want to avoid bank conflicts one solution is that each thread reads/writes in consecutive bank position respect the previous thread. I mean, in your case, if NUM_THREADS is multiple of 32, you won’t have problems with:

Yi[threadIdx.x] = Xi

More over, reading Xi isn’t cause problems because all the threads in the WARP are reading the same value of X at the same time, so broadcasting isn’t cause any bank conclict.
Think that you have 32 banks, to avoid conflicts, each thread of a WARP needs to access to one bank.
You have examples in the Cuda programming Guide. Appendix F.5.3 Shared Memory.

Thanks! I have a further issue about memory alignment. With regards to the previous question, let me start from the scratch to better explain myself: I have a square matrix “mat_2D” of size_x = size_y (say for example = 64). I flatten the matrix to 1D array and copy as mat_1D to global memory as:

for (y=0;y<size_y;y++){
for (x=0;x<size_x;x++){
k = nx*y + x;
mat_1D[k] = mat_2D[y];
}}

CUDA_SAFE_CALL(cudaMemcpy(mat_1D_dev, mat_1D, size_xsize_ysizeof(float), cudaMemcpyHostToDevice));

So I have the 2D matrix stored as 1D array (row-wise) in global memory. I now want to pass it to shared memory for processing. I want to process the matrix by rows, so the thread block is (size_x, 1), that is (64,1) for 64 blocks. I allocate shared variables as:

shared float Y1[64];

shared float X1;

X1 = mat_1D[idx];

// do stuff, update X1 and copy to Y1 (with some shifts, say for example + 1)

Y1 [threadIdx.x + 1] = X1;

__syncthreads();

// now write back to global memory

data_out [idx] = Y1 [threadIdx.x];

// -------------

My problem is to define a global index idx, which I know should be in the form:

idx = blockIdx.y * size_x + threadIdx.x;

but I want to avoid bank conflict (addressed above) and that corresponding elements of rows (i.e. mat_2D[i][j-1] and mat_2D[i][j] and mat_2D[i][j+1]) are kept aligned, say at 16*sizeof(float). So in summary it should be like “BaseAddress + sizeof(float)*threadIdx.x”. I tried providing an offset like:

offset = 16;
int idx = blockIdx.y * size_x + offset + sizeof(float)*threadIdx.x;

But I am not able to achieve the alignment I want. I am looking at the shared memory arrangement reported below (for CP1.1, for CP2.1 should be the same but with 32 banks). Correct? Could you help me to define idx correctly?

     bank0 .... bank15

row 0 [ 0 … 15 ]
1 [ 16 … 31 ]
2 [ 32 … 47 ]
3 [ 4 … 63 ]
4 [ 64 … 79 ]
5 [ 80 … 95 ]
6 [ 96 … 111 ]
7 [ 112 … 127 ]
8 [ 128 … 143 ]
9 [ 144 … 159 ]
10 [ 160 … 175 ]
11 [ 176 … 191 ]
12 [ 192 … 207 ]
13 [ 208 … 223 ]
14 [ 224 … 239 ]
15 [ 240 … 255 ]
col 0 … col 15

Thanks in advance for any help!

The subsection titled “Strided Order to Thread Order” gives a very good overview of how to read and write in a coalesced manner to global while working with the data in shared memory without bank conflicts.
http://www.moderngpu.com/intro/workflow.html

While that site can explain it much better than I, the long story short is that you need to use different indexing schemes, such that you read from global memory using one indexing scheme, then write to shared using a different one, so that each thread in a row is writing to a different bank (hence the shared memory stride being declared to be relativly prime to the warp size (i.e. having 33 elements))

i.e. to read data from global to shared

const int Stride = WARP_SIZE + 1;
    const int SharedSize = VALUES_PER_THREAD * Stride;
    __shared__ volatile uint shared[SharedSize];
 
    // Index into shared memory with strided order.
    volatile uint* stridedPointer = shared + tid;

    ....

    #pragma unroll
    for(int i = 0; i < VALUES_PER_THREAD; ++i) {
        uint globalIndex = i * WARP_SIZE + tid;     // Global memory index
        uint sharedIndex = i * (WARP_SIZE + 1);     // Use relative prime stride
        stridedPointer[sharedIndex] = values_global[globalIndex];
    }

    ....

    // Index into shared memory with thread order.
    uint offset = VALUES_PER_THREAD * tid;
    offset += offset / WARP_SIZE;
    volatile uint* threadPointer = shared + offset;

    // Load from shared memory into thread order.
    uint values[VALUES_PER_THREAD];
    #pragma unroll
    for(int i = 0; i < VALUES_PER_THREAD; ++i)
        // Even with the relatively prime number adjustment, values for each
        // thread are still adjacent in shared memory, so do a direct copy out
        // of threadPointer.
        values[i] = threadPointer[i];

EDIT:
I should also point out that the process of writing the data back from shared memory to global memory uses the same scheme, and that all code posted is directly from the linked URL above. I’d highly recommend you to give it a look