Bank conflicts with 2D shared mem array Resolving bank conflicts


I’m having a bit of a problem trying to figure out what to do with the vast amount of bank conflicts with my code:

__shared__ float f1Patch[16][16];

f1Patch[threadIdx.x][threadIdx.y] = some value;

Each one of my blocks has 16x16 threads, so each thread should load one element from global mem into the shared memory.

I think I understand why I have bank conflicts, the first 16 writes (ie: y = 0) should be ok, but each subsequent row will attempt to access the same bank as all of the other rows.

Is there anything I can do to optimize this? I’m trying to grab a square patch of pixels from an image for processing.

Thanks in advance!

Have a look at the Transpose example. They’re grabbing a square block, and there’s bank conflicts when the rows are read. They make the shared array like this:

shared[16][16 + 1]

I think this staggers the accesses to the rows … You use 16 positions, but there’s cycles of 17.

Makes sense / is applicable?