I have this function
[codebox]global void getArrays(float* inputMatrix, float* outputMatrix, int outMatrixColNum, int WinSize, int offset=2)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
for(int x=0;x<outMatrixColNum;x++)
{
if (x*offset==idx)
{
for(int k=0;k<WinSize;k++)
{
outputMatrix[WinSize*x+k] = inputMatrix[idx+k];
}
break;
}
}
}[/codebox]
It makes from a vector matrix, that is written in column-wised order.
For example, if I have vector [1 2 3 4 5 6 7], size=3, offset=2;
then the output is 123345567
My question is: how can I optimize this code, using shared memory or something else?
Thanks.