copy a matrix in global to a vector in shared avoiding bank conflicts

I need to avoid bank conflicts coping a window from a matrix in global memory to a vector in shared, something like this:

//needed to zeroes all the elements of the vector

[codebox]*(subvector+ thidx)=0;





        *(subvector + (n * colmatrix) + thidx)=*(in + ((colstep*blockCol)+thidx)*rowimg + ((rowstep*blockRow)+n));




this code manages a moving windows over a matrix;

colstep and rowstep represent how much window has to move in the x,y direction


I assume that:

  • subvector is a pointer to an array of floats/ints which resides in shared memory
  • in is a pointer to an array of floats/ints which resides in global memory.

Your solution yelds no bank conflicts: All threads access a portion of consecutive 32-bit words of subvector array, this is most convinient way of accessing shared memory. I see however some other problems:
a) if colwind is small compared to the size of your block, many threads and warps will stay idle and won’t hide access latencies
b) if rowimg is big, your warp will access global memory in very inneficient way: Each thread access piece of memory which is far away from cells other threads are accessing.

regarding b - If you are reading only columns and never rows of your matrix, you could store the matrix in a transposed form in global, so that loading a single vector would be just moving a single piece of memory from global to shared.

I need to copy a window of sides rowwind,colwind of a matrix of size rowimg x colimg,
this windows moves over the matrix among x and y directions of a measure indicated by rowstep and colstep.
I have a windows for each block and a blockSize of 32.

I need to copy the window before into a vector of 32 elements, zeroing elements that exceed (in case of (rowwind x colwind)<0)

Maybe in the previous algorithm there are some errors;

I’ve implemented this algorithm but it is too much slow, i need to let it go faster.
In wich way i can do this?

P.S.: in is a pointer to a matrix which resides in global memory.