Rows swap in a column-major stored matrix

Hi,

I would like to write a routine that will swap rows in a column-major storage based matrix. I’d like to reproduce the behavior of the DLASWP Lapack routine :

http://www.netlib.org/lapack/explore-3.1.1…l/dlaswp.f.html

Here is the routine I wrote untill now :

__device__

void interchanger(int i1,int i2,double* A, int LDA)

{

	unsigned int idx=threadIdx.x + blockDim.x * blockIdx.x;

	double aux2=A[i2+LDA*idx];

	double aux1=A[i1+LDA*idx];

	A[i2+LDA*idx]=aux1;

	A[i1+LDA*idx]=aux2;

}

template<unsigned int step>

__global__

void DLASWPr_kernel(int N,double* A,int LDA,int I1,int I2,int* IPIV,int IX0, int INC,int INCX)

{

	unsigned int tx=threadIdx.x;

	unsigned int idx=threadIdx.x + blockDim.x * blockIdx.x;

	int IX=IX0;

	if(idx<N)

	{

		if(step==BLOCK)

		{

			__shared__ double IPIVshar[BLOCK];

			

			IPIVshar[tx]=IPIV[tx+IX0-1];

			__syncthreads();

			int I=I1;

			int i2,i1;

#pragma unroll BLOCK

			for(int i=0;i<BLOCK;++i)

			{

				i2=IPIVshar[i];

				i1=I-1;

				if(i2!=i1)

				{

					interchanger(i1,i2,A,LDA);

				}

				IX+=INCX;

				I+=INC;

			}

		}

		else

		{

			for(int I=I1;I<=I2;I+=INC)

			{

				int i2=IPIV[IX-1];

				int i1=I-1;

					if(i2!=i1)

					{

						interchanger(i1,i2,A,LDA);

					}

				IX+=INCX;

			}

		}

	}

	__syncthreads();

}

void DLASWP(int N,double* A,int LDA,int K1,int K2,int* IPIV, int INCX)

{

	int IX0,I1,I2,INC;

	int* gpuIPIV;

	if(INCX==0)

		return;

	if(INCX>0)

	{

		IX0=K1;

		I1=K1;

		I2=K2;

		INC=1;

	}

	if(INCX<0)

	{

		IX0=1+(1-K2)*INCX;

		I1=K2;

		I2=K1;

		INC=-1;

	}

	cudaHostGetDevicePointer((void**)&gpuIPIV,(void*)(IPIV),0);

	dim3 grid((N/BLOCK)+1);

	dim3 threads(BLOCK);

	//DLASWPcublas(N,A,LDA,I1,I2,IPIV,IX0,INC,INCX);

	if(K2-K1+1<LDA)

	{

		if(K2-K1+1==BLOCK)

			DLASWPr_kernel<BLOCK><<<grid,threads>>>(N,A,LDA,I1,I2,gpuIPIV,IX0,INC,INCX);

		else

			DLASWPr_kernel<0><<<grid,threads>>>(N,A,LDA,I1,I2,gpuIPIV,IX0,INC,INCX);

	}

	else

	{

		double* cached;

		cublasAlloc(N*LDA,sizeof(double),(void**)&cached);

		transpose(LDA,N,A,cached);

		DLASWPc_kernel<<<grid,threads>>>(LDA,cached,N,I1,I2,gpuIPIV,IX0,INC,INCX);

		transpose(N,LDA,cached,A);

		cublasFree(cached);

	}

}

IPIV is a 1D pinned-array that hold the index of permutation (ie for each i betwean I1 and I2, lines i-1 and IPIV[i-1] are swapped ; the -1 is an offset corresponding to the different representation of array in C and FORTRAN)

According to Nvidia Profiler, the bandwith is of 5 Gb/s (I have a 280GTX), which is somewhat slow for a device only memory manipulation.

The problem is that in such matrix, the elements in a line are not contigous, so it’s not possible to have coalesced memory access…

I tried to make a transpose routine to have coalesced memory access and to put matrix in row-major storage, but my transpose routine was awfully slow : the bandwith of the routine was 3Gb/s. I’m not even sure there was coalesced memory access with the transpose routine, but in fact I don’t know how to have record of how transactions are done in the hardware (I used cublasAlloc and double type everywhere, so alignement must be correct for coalesced access…).

I though of using shared memory inside DLASWP, as the routine will work on every lines between I1 and I2 (so if I2-I1=multiple of 16, I could get coalesced access) : I can have an “image” of lines I1 to I2 in shared memory, work on it and on lines outside this image by using access to global memory, then writing the “image” back to the matrix, using also coalesced access.

This idea was not so good, as my implemantation completely broke my program, I don’t have any clues why…

Could someone give me some advice/suggestion on how to implement a performing version of DLASWP please ?

Thanks for help !

have you looked at the new transpose samples that ship with 2.2?

In a column-major matrix, each element in a coalesced memory access corresponds to a different row, so why don’t you read the ‘columns’ to shared memory and then read it back from shared memory according to the required permutation?
Since it’s a permutation, you should be able to order the shared memory accesses such that there are no bank conflicts (although this may require additional shared memory storage so that you can also coalesce the memory writes).

N.

I used this trick and got it working a few hour ago only (I had to deal with a lot of indexing trap).

Here is my new code :

__global__

void DLASWPr_kernel(int N,double* A,int LDA,int I1,int I2,int* IPIV,int IX0, int INC,int INCX)

{

	unsigned int tx=threadIdx.x;

	unsigned int idx=threadIdx.x + blockDim.x * blockIdx.x;

	unsigned int offsetrow=blockDim.x * blockIdx.x;

	int IX=IX0;

	__shared__ double cached[BLOCK][BLOCK];

	if(tx+I1-1<LDA)

	{

		for(int i=0;i<MIN(BLOCK,N-offsetrow);++i)

		{

			cached[tx][i]=A[(I1+tx-1)+LDA*(offsetrow+i)];

		}

	}

	__syncthreads();

		

	if(idx<N)

	{

		int i2,i1;

		for(int I=I1;I<=I2;I+=INC)

		{

			i2=IPIV[IX-1];

			i1=I-1;

			if(i2<I2)

			{

				double aux1=cached[i1-I1+1][tx];

				double aux2=cached[i2-I1+1][tx];

				cached[i2-I1+1][tx]=aux1;

				cached[i1-I1+1][tx]=aux2;

			}

			else

			{

				double aux=cached[i1-I1+1][tx];

				cached[i1-I1+1][tx]=A[i2+LDA*idx];

				A[i2+LDA*idx]=aux;

			}

			IX+=INCX;

			__syncthreads();

		}

	}

	__syncthreads();

	if(tx+I1-1<LDA)

	{

		for(int i=0;i<MIN(BLOCK,N-offsetrow);++i)

		{

			A[(I1+tx-1)+LDA*(offsetrow+i)]=cached[tx][i];

		}

	}

	__syncthreads();

}

I get a bandwith of almost 10 Gb/s using shared memory (“just” twice as fast as the previous version). I don’t know if I can reach a higher bandwith… Bank conflict doesn’t seems to slow the thing too much, do it ?

Perhaps the __syncthreads() in the second loop is useless, and perhaps it could worth it to unroll loop (on my previous implementation, it didn’t)(BLOCK=32)

I think of using BLOCK+1 as show in the transpose doc, but I’m not sure I can have a +50% speedup in bandwith with these tricks. I write to the same location I read, so I don’t think it’s possible to take profit of the partition structure of the global memory structure.

If your algorithm requires a lot of swap, using index tables will be better than swapping data actually.