Any Elegant Solutions for dealing with ghost cells?


I was wondering what kind of elegant solutions have been found for dealing with accessing and updating ghost cells in global (i.e. for the entire grid) and shared memories (for a single block) in 2D, 3D , etc D arrays that avoid uncoalessed global memory I/O and shared memory bank conflicts.



I need to implement ghost cells on my code too. I was wondering if any had any sample code like comuting average of two consecetive array elements.

There are different ways that this could be implemented, but i am not sure what is the best way on CUDA architecture.


What is a ghost cell?

I guess a ‘ghost cell’ refeers to overhead, nonexistent cells which are there just for better memory access patterns.
Imagine you want a 15x15 matrix, but you want each row to be aligned to 16, then you introduce an extra column with (for example) all 0-es.

Please correct me, if I am mistaken with this explanation.

The content of the cells and how you handle them depends on what you want to do on the array and there is no general solution to all possible situations. Maybe you could explain more in detail what you want to accomplish?

I imagine that in this context, ghost cells are there to enable a domain decomposition. Suppose you had a Really Big Grid™ such that it didn’t fit into the memory of a single machine (say 4096^3). It should be easy enough to divide this grid into sub-grids of size 16^3 (or so) and farm them out to different processors. However, those subgrids need to communicate with each other. Enter ghost cells. Each 16^3 sub-grid of ‘active’ cells is surrounded by a larger (two or more deep) of ‘ghost cells’ which just mirror the contents of the neighbouring sub-grids. The communication system has to keep these ghost cells updated between iterations.

OK then, you are mistaken.

Ghost Cells arise in numerical solution of differential equations as a way on enforcing boundary conditions or affecting domain decomposition, usually in the spatial domain. The idea is to decorate the boundaries of computational grids or meshes with additional data (usually created by interpolation at true boundaries, or copying from adjacent sub-domains) which will approximate the appropriate boundary condition or maintain continuity when the differential equation is solved. Such additional cells or elements are called Ghost Cells and their content Ghost Data. It is a very common technique for solving PDEs using Finite Difference and Finite Volume methods and is widely used in fluids mechanical, electromagnetic, and heat and mass transfer calculations.

I stand corrected.

Ok, maybe I am using the wrong phrase, basically my question is that if i have an array of data and i want to perform some operation where every thread will need to access two elements in the same array.

avg[thx]= array[thx]+array[thx+1]/2

how can i arrange the memory so that every thread in each block can compute the average. obviousely the last thread of the block cant access the array[thx+1] and that is the porblem. I know for this case we can use overlapping but what if what we are trying to compute is the following:


here we wont be able to use overlapping as all the values of each block needs to be calculated and then the last two elements needs to be sent to next block.

I was thinking then we should call the kernel in a loop and have only one block. I was wondering if there is any better way of doing it.

I hope I was able to make my point :)

I used Halo cells (what I think you call Ghost cells) in a code. The only way I could think to do it was to have the loop iteration outside the kernel call, so that it could be guaranteed writes to global memory had been completed by all threads. It may be possible to do it using threadfence and atomic writes.