global memory access with indirection and the perf' falls !

I have a question about the memory organisation in global memory.

To make true SIMD (Single Instruction Multiple Data) kernels, I have to add ghost cells to my tables before and after the domain’s cells. And in order to compute only on the domain, I use that trick:

myker <<<grid, block>>> (tab1+nc_ghost, tab2+nc_ghost);

wich drastically slows down the execution (depending on the amount of data but at least 3 times slower).

The same happens if I add nc_ghost to the thread index computation within the kernel:

//tx is the position in the grid

int thpos = tx + nc_ghost; //the position of the data to read in global memory

But, if there is no shift (ie: I don’t add nc_ghost to any adress), the kernel works on the whole tables and the card reaches the theoretical performances.

This makes me think that the threads are “close” to global memory according to their rank (in the grid), and when shifting an adress you go out of your memory page and it takes more time to reach the data.

Am I right ? Any idea to bypass the problem ?

How much is nc_ghost?
We have to do the evil 128 byte alignment for coalescing, or use textures.

nc_ghost vary between 1 and 2.
The total number of cells being domain + 2*nc_ghost.

Using an alignment of 128bytes (ie 32 more floats) gives better results.
G80 keeps some surprises !