Hi all!

I have to solve the poisson equation ( lap(u) = f ) with a finite difference scheme. I’ve read a lot of forum topics and pdf’s but I didn’t find a concrete answer.

I have a 2D discretized grid with Nx+2 x Ny+2 cells (+2 in each direction are the ghost cells because of the boundary conditions)

Here there is a naive version of the code that I would like to improve using shared memory.

```
__global__ void RB(double *u, double *f, int pitch, double dx, double dy, int nx, int ny, int parity){
int j = 1 + blockIdx.y*blockDim.y + threadIdx.y;
int i = 1 + 2*(blockIdx.x*blockDim.x + threadIdx.x) + (j+parity)&1;
double invdx2, invdy2, c;
invdx2 = 1./(dx*dx);
invdy2 = 1./(dy*dy);
c = 1./(-2.*(invdx2+invdy2));
if(i<nx+1 && j<ny+1)
u[IDX2C(i,j,pitch)] = (-(u[IDX2C(i-1,j,pitch)]+u[IDX2C(i+1,j,pitch)])*invdx2-(u[IDX2C(i,j-1,pitch)]+u[IDX2C(i,j+1,pitch)])*invdy2+f[IDX2C(i,j,pitch)])*c;
}
```

As u can see, each thread read from global memory 4 u datas (because of the 2D stencil) and 1 f data. The data loaded by a thread have to be reloaded by neighbour threads hence with shared memory I should improve the memory throughput and performance …

This is a first try with shared memory which doesn’t give the aspected improvements.

```
__global__ void RB(double *u, double *f, int pitch, double invdx2, double invdy2, double c, int nx, int ny, int parity){
int j = blockIdx.y*BROWS + threadIdx.y;
int i = blockIdx.x*BCOLS + 2*threadIdx.x + ((j+parity)&1);
int offset = 1-2*((j+parity)&1);
extern __shared__ double sdata[];
sdata[IDX2C(threadIdx.x,threadIdx.y,BCOLS+2)] = u[IDX2C(i,j,pitch)];
sdata[IDX2C(threadIdx.x+offset,threadIdx.y,BCOLS+2)] = u[IDX2C(i+offset,j,pitch)];
__syncthreads();
if(threadIdx.x>0 && 2*threadIdx.x<BCOLS+1 && threadIdx.y>0 && threadIdx.y<BROWS+1)
f[IDX2C(i,j,pitch)] = c*(-(sdata[IDX2C(2*threadIdx.x-1,threadIdx.y,BCOLS+2)] + sdata[IDX2C(2*threadIdx.x+1,threadIdx.y,BCOLS+2)])*invdx2 -(sdata[IDX2C(2*threadIdx.x,threadIdx.y-1,BCOLS+2)] + sdata[IDX2C(2*threadIdx.x,threadIdx.y+1,BCOLS+2)])*invdy2 + f[IDX2C(i,j,pitch)]);
}
```

The call to RB kernel is something like this

```
dim3 threads_per_block_RB(BCOLS/2+1, BROWS+2);
dim3 blocks_per_grid_RB(NX/BCOLS, NY/BROWS);
RB<<<blocks_per_grid_RB, threads_per_block_RB,(BROWS+2)*(BCOLS+2)*sizeof(double)>>>(dev_u, dev_f, pitch/sizeof(double), invdx2, invdy2, c, nx, ny, 0);
```

Each thread copies from global to shared memory 2 data (because of the red-black scheme I can use half the thread in the x direction) and than performs the RB iteration.

I’ve got no improvement with the second code. Can u explain me why and how can I speed up my code? All the writes are coalesced but it seems that there are un-coalesced reads …

Any suggestions or comments?