Performance issue R/W operation too slow

Hi guys,

I have a kernel function that needs gets in input s matrix in column major form, the row and column of the pivot and the number of rows and columns of the matrix. It needs to do the following:

  1. Test if the pivot is 0

  2. if it is, find the row R of first element on the same column that is !=0

  3. swap the pivot row with R.

Here is the code which should work, my problem is that for big matrices it is extremely slow.

In particular I identified that the bottleneck seems to be the part commented at the end of the function (which is obviously

necessary to perform the rows swap.

I really do not understand why it is so slow, in theory it should be just three R/W operations.

Why is that affecting so badly performances?

__global__ void swapRows( float* A, int pivRow,int pivCol, int rows, int cols) 


	int index = blockDim.x * blockIdx.x + threadIdx.x;


	int pivInd=pivCol*rows+pivRow;


		int i=pivInd+1;

		int swapIndex=-1;

			while(i< (pivCol+1)*rows && swapIndex==-1){





					//Let's increment i







	float tmp=0;

	if(index%rows-pivRow == 0){

//		tmp=A[index];

//		A[index]=A[index+(swapIndex-pivRow)];

//		A[index+(swapIndex-pivRow)]=tmp;






You should try to use shared memory, that has really little latency, while global memory has big latency. Have a look at matrix example in programming guide.


Ok thanks, you mean the transpose matrix example?


Matrix multiplication, sec. 3.2.2 of Programming Guide 2.3.


Ok thanks a lot.


Sorry, still puzzled… In my case, how can I split my matrix in blocks and make the two or more blocks with the elements to be swapped communicate with each other?