Finite difference - poisson equation gauss seidel red-black iterative method

Hi all!

I have to solve the poisson equation ( lap(u) = f ) with a finite difference scheme. I’ve read a lot of forum topics and pdf’s but I didn’t find a concrete answer.

I have a 2D discretized grid with Nx+2 x Ny+2 cells (+2 in each direction are the ghost cells because of the boundary conditions)

Here there is a naive version of the code that I would like to improve using shared memory.

__global__ void RB(double *u, double *f, int pitch, double dx, double dy, int nx, int ny, int parity){

    int j = 1 + blockIdx.y*blockDim.y + threadIdx.y;

    int i = 1 + 2*(blockIdx.x*blockDim.x + threadIdx.x) + (j+parity)&1;

    double invdx2, invdy2, c;

invdx2 = 1./(dx*dx);

    invdy2 = 1./(dy*dy);

    c = 1./(-2.*(invdx2+invdy2));

if(i<nx+1 && j<ny+1)

        u[IDX2C(i,j,pitch)] = (-(u[IDX2C(i-1,j,pitch)]+u[IDX2C(i+1,j,pitch)])*invdx2-(u[IDX2C(i,j-1,pitch)]+u[IDX2C(i,j+1,pitch)])*invdy2+f[IDX2C(i,j,pitch)])*c;

}

As u can see, each thread read from global memory 4 u datas (because of the 2D stencil) and 1 f data. The data loaded by a thread have to be reloaded by neighbour threads hence with shared memory I should improve the memory throughput and performance …

This is a first try with shared memory which doesn’t give the aspected improvements.

__global__ void RB(double *u, double *f, int pitch, double invdx2, double invdy2, double c, int nx, int ny, int parity){

    int j = blockIdx.y*BROWS + threadIdx.y;

    int i = blockIdx.x*BCOLS + 2*threadIdx.x + ((j+parity)&1);

int offset = 1-2*((j+parity)&1);   

    extern __shared__ double sdata[];

sdata[IDX2C(threadIdx.x,threadIdx.y,BCOLS+2)] = u[IDX2C(i,j,pitch)];

    sdata[IDX2C(threadIdx.x+offset,threadIdx.y,BCOLS+2)] = u[IDX2C(i+offset,j,pitch)];

__syncthreads();

if(threadIdx.x>0 && 2*threadIdx.x<BCOLS+1 && threadIdx.y>0 && threadIdx.y<BROWS+1)

        f[IDX2C(i,j,pitch)] = c*(-(sdata[IDX2C(2*threadIdx.x-1,threadIdx.y,BCOLS+2)] + sdata[IDX2C(2*threadIdx.x+1,threadIdx.y,BCOLS+2)])*invdx2 -(sdata[IDX2C(2*threadIdx.x,threadIdx.y-1,BCOLS+2)] + sdata[IDX2C(2*threadIdx.x,threadIdx.y+1,BCOLS+2)])*invdy2 + f[IDX2C(i,j,pitch)]);

}

The call to RB kernel is something like this

dim3 threads_per_block_RB(BCOLS/2+1, BROWS+2);

dim3 blocks_per_grid_RB(NX/BCOLS, NY/BROWS);

RB<<<blocks_per_grid_RB, threads_per_block_RB,(BROWS+2)*(BCOLS+2)*sizeof(double)>>>(dev_u, dev_f, pitch/sizeof(double), invdx2, invdy2, c, nx, ny, 0);

Each thread copies from global to shared memory 2 data (because of the red-black scheme I can use half the thread in the x direction) and than performs the RB iteration.

I’ve got no improvement with the second code. Can u explain me why and how can I speed up my code? All the writes are coalesced but it seems that there are un-coalesced reads …

Any suggestions or comments?

Have you thought about simply transforming your problem in to solving a large system of linear equations and using cublas routines to solve them? I assume that it would be faster and easier to control.

If it is on Fermi hardware don’t expect more than 30% performance increase.
What i would advice is the following:(I couldn’t tell if that is what you were doing,it seemed close if not the same)
Suppose that you have 16*16 thread blocks
allocate a s[16+2][16+2] and copy the data there.Use thread0 to fetch the hallo cells (this is where the uncoalesced reads are coming from,i think )
Read from the sdata and save to global memory.(you are already doing this i think).
One other thing you can do is exploit texture cache to avoid hallo cells.That should work better than shared memory but you could even combine the two.(read from texture cache)