Hi!
I’m having a bit of a problem trying to figure out what to do with the vast amount of bank conflicts with my code:
__shared__ float f1Patch[16][16];
f1Patch[threadIdx.x][threadIdx.y] = some value;
Each one of my blocks has 16x16 threads, so each thread should load one element from global mem into the shared memory.
I think I understand why I have bank conflicts, the first 16 writes (ie: y = 0) should be ok, but each subsequent row will attempt to access the same bank as all of the other rows.
Is there anything I can do to optimize this? I’m trying to grab a square patch of pixels from an image for processing.
Thanks in advance!