# Finite difference - poisson equation gauss seidel red-black iterative method

Hi all!

I have to solve the poisson equation ( lap(u) = f ) with a finite difference scheme. I’ve read a lot of forum topics and pdf’s but I didn’t find a concrete answer.

I have a 2D discretized grid with Nx+2 x Ny+2 cells (+2 in each direction are the ghost cells because of the boundary conditions)

Here there is a naive version of the code that I would like to improve using shared memory.

``````__global__ void RB(double *u, double *f, int pitch, double dx, double dy, int nx, int ny, int parity){

int j = 1 + blockIdx.y*blockDim.y + threadIdx.y;

int i = 1 + 2*(blockIdx.x*blockDim.x + threadIdx.x) + (j+parity)&1;

double invdx2, invdy2, c;

invdx2 = 1./(dx*dx);

invdy2 = 1./(dy*dy);

c = 1./(-2.*(invdx2+invdy2));

if(i<nx+1 && j<ny+1)

u[IDX2C(i,j,pitch)] = (-(u[IDX2C(i-1,j,pitch)]+u[IDX2C(i+1,j,pitch)])*invdx2-(u[IDX2C(i,j-1,pitch)]+u[IDX2C(i,j+1,pitch)])*invdy2+f[IDX2C(i,j,pitch)])*c;

}
``````

As u can see, each thread read from global memory 4 u datas (because of the 2D stencil) and 1 f data. The data loaded by a thread have to be reloaded by neighbour threads hence with shared memory I should improve the memory throughput and performance …

This is a first try with shared memory which doesn’t give the aspected improvements.

``````__global__ void RB(double *u, double *f, int pitch, double invdx2, double invdy2, double c, int nx, int ny, int parity){

int j = blockIdx.y*BROWS + threadIdx.y;

int i = blockIdx.x*BCOLS + 2*threadIdx.x + ((j+parity)&1);

int offset = 1-2*((j+parity)&1);

extern __shared__ double sdata[];

}
``````

The call to RB kernel is something like this

``````dim3 threads_per_block_RB(BCOLS/2+1, BROWS+2);

dim3 blocks_per_grid_RB(NX/BCOLS, NY/BROWS);

RB<<<blocks_per_grid_RB, threads_per_block_RB,(BROWS+2)*(BCOLS+2)*sizeof(double)>>>(dev_u, dev_f, pitch/sizeof(double), invdx2, invdy2, c, nx, ny, 0);
``````

Each thread copies from global to shared memory 2 data (because of the red-black scheme I can use half the thread in the x direction) and than performs the RB iteration.

I’ve got no improvement with the second code. Can u explain me why and how can I speed up my code? All the writes are coalesced but it seems that there are un-coalesced reads …