Memory sharing across blocks

I have a problem where I am calculating dynamics of a 2D grid at each grid cell. These dynamics include self-dependent and neighbor-dependent terms, such as evaporation (self) and diffusion (neighbors). I have found that an additional check must be implemented that scales the net flux of each grid cell to enforce conservation laws, thus I store the fluxes in their own grids (North, East, South, West, Center), sum them all up, scale them, then update the grids for the next time step.

More specifically, I only calculate known in-fluxes (such as precipitation) and calculated out-fluxes of each grid cell, since the calculated in-fluxes could be reduced after the check for conservation. I then say that the northern out-flux of one grid cell is the southern in-flux of the grid cell just above it. If we put this on a typical Cartesian grid with x+ to the right and y+ down, it looks something like

| X
v Y

centerInFlux(x,y) = a1 * northOutFlux(x,y+1) + a2 * eastOutFlux(x-1,y) + a3 * southOutFlux(x,y-1) + a4 * westOutFlux(x+1,y)

where the a1, a2, a3, a4 values are scaling values in the range [0, 1] calculated to ensure continuity and non-negative mass values, i.e.

if currentValue(x,y) - northOutFlux(x,y) - eastOutFlux(x,y) - southOutFlux(x,y) - westOutFlux(x,y) < 0
then a1 = currentValue(x,y) / ( northOutFlux(x,y) + eastOutFlux(x,y) + southOutFlux(x,y) + westOutFlux(x,y) )
else a1 = 1

such that currentValue(x,y) - a1 * ( northOutFlux(x,y) + eastOutFlux(x,y) + southOutFlux(x,y) + westOutFlux(x,y) ) >= 0

The issue that I’ve found is that if the out flux grids are stored as typical global arrays, then every 16 pixels (I have 16x16 blocks) gets poor data since the threads cannot access the memory of at least one neighbor.

The temporary solution I’ve found is to store the calculated out-fluxes in global arrays, cudamemcopy them to cuda arrays, then bind them to textures. This works just fine, but since the binding is a host function it is extremely slow. I lost about 50% of my computation speed by doing this (there are a lot more grids and fluxes that I have to keep track of than just the N/E/S/W/C).

I know that memory sharing across blocks is not a possibility for now, especially as I am running compute capability 1.1. I also know that writing to global/texture memory is slow and inefficient, especially when it has to be done ~25 times per iteration.

One thing I think would speed it up is to be able to control where the block is centered about, then “move” it across the grid. In this way I’d only update grid cells that are away from the block’s borders, then shift the block by a few grids and get the ones that I skipped before. The only issue is I don’t know how to go about doing this. Any ideas?

I tried to explain everything as best I could, but I know it may be confusing. Please let me know if there is something I can clarify. I cannot really post code since it is all intertwined and I’d have to post almost all of it for any of it to make sense.

A first suggestion would be to not use cuda arrays, but bind the global arrays to textures directly.
There is no need to use cuda arrays for textures unless you want to use the interpolation hardware (usually a bad idea for scientific computing, due to the low precision) or
other advanced features. The binding of global arrays to textures should be fast.