Shifted copy of Vector

I have this function

[codebox]global void getArrays(float* inputMatrix, float* outputMatrix, int outMatrixColNum, int WinSize, int offset=2)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

for(int x=0;x<outMatrixColNum;x++)

{

if (x*offset==idx)

{

for(int k=0;k<WinSize;k++)

{

outputMatrix[WinSize*x+k] = inputMatrix[idx+k];

}

break;

}

}

}[/codebox]

It makes from a vector matrix, that is written in column-wised order.

For example, if I have vector [1 2 3 4 5 6 7], size=3, offset=2;

then the output is 123345567

My question is: how can I optimize this code, using shared memory or something else?

Thanks.

Depending on the size of the matrix, my suggestion was something like

[codebox]

extern “C”

global void getArrays(float* inputMatrix, float* outputMatrix, int columns, int rows, int step)

{

int c = threadIdx.x;

for(int r=0;r<rows;r++)

{

    outputMatrix[rows*c+r] = inputMatrix[c*step+r];

}

}

[/codebox]

for matrices with <=512 columns, to be launched with one thread per column. This is certainly far from optimal concerning memory coalescing, but should be faster than the original function…

Futher optimizations are left to the CUDA experts ;)