Hi guys,
I have a kernel function that needs gets in input s matrix in column major form, the row and column of the pivot and the number of rows and columns of the matrix. It needs to do the following:

Test if the pivot is 0

if it is, find the row R of first element on the same column that is !=0

swap the pivot row with R.
Here is the code which should work, my problem is that for big matrices it is extremely slow.
In particular I identified that the bottleneck seems to be the part commented at the end of the function (which is obviously
necessary to perform the rows swap.
I really do not understand why it is so slow, in theory it should be just three R/W operations.
Why is that affecting so badly performances?
__global__ void swapRows( float* A, int pivRow,int pivCol, int rows, int cols)
{
int index = blockDim.x * blockIdx.x + threadIdx.x;
if(index<rows*cols){
int pivInd=pivCol*rows+pivRow;
if(A[pivInd]==0){
int i=pivInd+1;
int swapIndex=1;
while(i< (pivCol+1)*rows && swapIndex==1){
if(A[i]!=0){
swapIndex=i;
}
else{
//Let's increment i
i++;
}
}
__syncthreads();
if(swapIndex!=1){
float tmp=0;
if(index%rowspivRow == 0){
// tmp=A[index];
// A[index]=A[index+(swapIndexpivRow)];
// A[index+(swapIndexpivRow)]=tmp;
}
}
}
}
}