Performance issue R/W operation too slow

Hi guys,

I have a kernel function that needs gets in input s matrix in column major form, the row and column of the pivot and the number of rows and columns of the matrix. It needs to do the following:

  1. Test if the pivot is 0

  2. if it is, find the row R of first element on the same column that is !=0

  3. swap the pivot row with R.

Here is the code which should work, my problem is that for big matrices it is extremely slow.

In particular I identified that the bottleneck seems to be the part commented at the end of the function (which is obviously

necessary to perform the rows swap.

I really do not understand why it is so slow, in theory it should be just three R/W operations.

Why is that affecting so badly performances?

__global__ void swapRows( float* A, int pivRow,int pivCol, int rows, int cols) 

{

	int index = blockDim.x * blockIdx.x + threadIdx.x;

	if(index<rows*cols){

	int pivInd=pivCol*rows+pivRow;

	if(A[pivInd]==0){

		int i=pivInd+1;

		int swapIndex=-1;

			while(i< (pivCol+1)*rows && swapIndex==-1){

				if(A[i]!=0){

					swapIndex=i;

				}

				else{

					//Let's increment i

					i++;

				}	

			}

			

		__syncthreads();

	if(swapIndex!=-1){

	float tmp=0;

	if(index%rows-pivRow == 0){

//		tmp=A[index];

//		A[index]=A[index+(swapIndex-pivRow)];

//		A[index+(swapIndex-pivRow)]=tmp;

		}

	}

	}

 }

}

You should try to use shared memory, that has really little latency, while global memory has big latency. Have a look at matrix example in programming guide.
Cheers,

luca

Ok thanks, you mean the transpose matrix example?

Thanks

Matrix multiplication, sec. 3.2.2 of Programming Guide 2.3.
Ciao,

luca

Ok thanks a lot.

ciao

Sorry, still puzzled… In my case, how can I split my matrix in blocks and make the two or more blocks with the elements to be swapped communicate with each other?