Hi,
I would like to write a routine that will swap rows in a column-major storage based matrix. I’d like to reproduce the behavior of the DLASWP Lapack routine :
http://www.netlib.org/lapack/explore-3.1.1…l/dlaswp.f.html
Here is the routine I wrote untill now :
__device__
void interchanger(int i1,int i2,double* A, int LDA)
{
unsigned int idx=threadIdx.x + blockDim.x * blockIdx.x;
double aux2=A[i2+LDA*idx];
double aux1=A[i1+LDA*idx];
A[i2+LDA*idx]=aux1;
A[i1+LDA*idx]=aux2;
}
template<unsigned int step>
__global__
void DLASWPr_kernel(int N,double* A,int LDA,int I1,int I2,int* IPIV,int IX0, int INC,int INCX)
{
unsigned int tx=threadIdx.x;
unsigned int idx=threadIdx.x + blockDim.x * blockIdx.x;
int IX=IX0;
if(idx<N)
{
if(step==BLOCK)
{
__shared__ double IPIVshar[BLOCK];
IPIVshar[tx]=IPIV[tx+IX0-1];
__syncthreads();
int I=I1;
int i2,i1;
#pragma unroll BLOCK
for(int i=0;i<BLOCK;++i)
{
i2=IPIVshar[i];
i1=I-1;
if(i2!=i1)
{
interchanger(i1,i2,A,LDA);
}
IX+=INCX;
I+=INC;
}
}
else
{
for(int I=I1;I<=I2;I+=INC)
{
int i2=IPIV[IX-1];
int i1=I-1;
if(i2!=i1)
{
interchanger(i1,i2,A,LDA);
}
IX+=INCX;
}
}
}
__syncthreads();
}
void DLASWP(int N,double* A,int LDA,int K1,int K2,int* IPIV, int INCX)
{
int IX0,I1,I2,INC;
int* gpuIPIV;
if(INCX==0)
return;
if(INCX>0)
{
IX0=K1;
I1=K1;
I2=K2;
INC=1;
}
if(INCX<0)
{
IX0=1+(1-K2)*INCX;
I1=K2;
I2=K1;
INC=-1;
}
cudaHostGetDevicePointer((void**)&gpuIPIV,(void*)(IPIV),0);
dim3 grid((N/BLOCK)+1);
dim3 threads(BLOCK);
//DLASWPcublas(N,A,LDA,I1,I2,IPIV,IX0,INC,INCX);
if(K2-K1+1<LDA)
{
if(K2-K1+1==BLOCK)
DLASWPr_kernel<BLOCK><<<grid,threads>>>(N,A,LDA,I1,I2,gpuIPIV,IX0,INC,INCX);
else
DLASWPr_kernel<0><<<grid,threads>>>(N,A,LDA,I1,I2,gpuIPIV,IX0,INC,INCX);
}
else
{
double* cached;
cublasAlloc(N*LDA,sizeof(double),(void**)&cached);
transpose(LDA,N,A,cached);
DLASWPc_kernel<<<grid,threads>>>(LDA,cached,N,I1,I2,gpuIPIV,IX0,INC,INCX);
transpose(N,LDA,cached,A);
cublasFree(cached);
}
}
IPIV is a 1D pinned-array that hold the index of permutation (ie for each i betwean I1 and I2, lines i-1 and IPIV[i-1] are swapped ; the -1 is an offset corresponding to the different representation of array in C and FORTRAN)
According to Nvidia Profiler, the bandwith is of 5 Gb/s (I have a 280GTX), which is somewhat slow for a device only memory manipulation.
The problem is that in such matrix, the elements in a line are not contigous, so it’s not possible to have coalesced memory access…
I tried to make a transpose routine to have coalesced memory access and to put matrix in row-major storage, but my transpose routine was awfully slow : the bandwith of the routine was 3Gb/s. I’m not even sure there was coalesced memory access with the transpose routine, but in fact I don’t know how to have record of how transactions are done in the hardware (I used cublasAlloc and double type everywhere, so alignement must be correct for coalesced access…).
I though of using shared memory inside DLASWP, as the routine will work on every lines between I1 and I2 (so if I2-I1=multiple of 16, I could get coalesced access) : I can have an “image” of lines I1 to I2 in shared memory, work on it and on lines outside this image by using access to global memory, then writing the “image” back to the matrix, using also coalesced access.
This idea was not so good, as my implemantation completely broke my program, I don’t have any clues why…
Could someone give me some advice/suggestion on how to implement a performing version of DLASWP please ?
Thanks for help !