I have a question about the memory organisation in global memory.
To make true SIMD (Single Instruction Multiple Data) kernels, I have to add ghost cells to my tables before and after the domain’s cells. And in order to compute only on the domain, I use that trick:
myker <<<grid, block>>> (tab1+nc_ghost, tab2+nc_ghost);
wich drastically slows down the execution (depending on the amount of data but at least 3 times slower).
The same happens if I add nc_ghost to the thread index computation within the kernel:
//tx is the position in the grid
int thpos = tx + nc_ghost; //the position of the data to read in global memory
But, if there is no shift (ie: I don’t add nc_ghost to any adress), the kernel works on the whole tables and the card reaches the theoretical performances.
This makes me think that the threads are “close” to global memory according to their rank (in the grid), and when shifting an adress you go out of your memory page and it takes more time to reach the data.
Am I right ? Any idea to bypass the problem ?