Memory sharing across blocks

diablobanshee · September 29, 2011, 1:30pm

I have a problem where I am calculating dynamics of a 2D grid at each grid cell. These dynamics include self-dependent and neighbor-dependent terms, such as evaporation (self) and diffusion (neighbors). I have found that an additional check must be implemented that scales the net flux of each grid cell to enforce conservation laws, thus I store the fluxes in their own grids (North, East, South, West, Center), sum them all up, scale them, then update the grids for the next time step.

More specifically, I only calculate known in-fluxes (such as precipitation) and calculated out-fluxes of each grid cell, since the calculated in-fluxes could be reduced after the check for conservation. I then say that the northern out-flux of one grid cell is the southern in-flux of the grid cell just above it. If we put this on a typical Cartesian grid with x+ to the right and y+ down, it looks something like

.------->
| X
|
|
v Y

centerInFlux(x,y) = a1 * northOutFlux(x,y+1) + a2 * eastOutFlux(x-1,y) + a3 * southOutFlux(x,y-1) + a4 * westOutFlux(x+1,y)

where the a1, a2, a3, a4 values are scaling values in the range [0, 1] calculated to ensure continuity and non-negative mass values, i.e.

if currentValue(x,y) - northOutFlux(x,y) - eastOutFlux(x,y) - southOutFlux(x,y) - westOutFlux(x,y) < 0
then a1 = currentValue(x,y) / ( northOutFlux(x,y) + eastOutFlux(x,y) + southOutFlux(x,y) + westOutFlux(x,y) )
else a1 = 1

such that currentValue(x,y) - a1 * ( northOutFlux(x,y) + eastOutFlux(x,y) + southOutFlux(x,y) + westOutFlux(x,y) ) >= 0

The issue that I’ve found is that if the out flux grids are stored as typical global arrays, then every 16 pixels (I have 16x16 blocks) gets poor data since the threads cannot access the memory of at least one neighbor.

The temporary solution I’ve found is to store the calculated out-fluxes in global arrays, cudamemcopy them to cuda arrays, then bind them to textures. This works just fine, but since the binding is a host function it is extremely slow. I lost about 50% of my computation speed by doing this (there are a lot more grids and fluxes that I have to keep track of than just the N/E/S/W/C).

I know that memory sharing across blocks is not a possibility for now, especially as I am running compute capability 1.1. I also know that writing to global/texture memory is slow and inefficient, especially when it has to be done ~25 times per iteration.

One thing I think would speed it up is to be able to control where the block is centered about, then “move” it across the grid. In this way I’d only update grid cells that are away from the block’s borders, then shift the block by a few grids and get the ones that I skipped before. The only issue is I don’t know how to go about doing this. Any ideas?

I tried to explain everything as best I could, but I know it may be confusing. Please let me know if there is something I can clarify. I cannot really post code since it is all intertwined and I’d have to post almost all of it for any of it to make sense.

mfatica · September 29, 2011, 2:33pm

A first suggestion would be to not use cuda arrays, but bind the global arrays to textures directly.
There is no need to use cuda arrays for textures unless you want to use the interpolation hardware (usually a bad idea for scientific computing, due to the low precision) or
other advanced features. The binding of global arrays to textures should be fast.

Topic		Replies	Views
Sharing data between blocks CUDA Programming and Performance	6	2728	January 29, 2015
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9297	November 12, 2007
Global linear device memory overwriting texture memory Writing to a global device memory CUDA Programming and Performance	2	1147	November 18, 2008
inter-block communication via global memory why my simple implementation failed? CUDA Programming and Performance	7	14471	December 4, 2007
Shared memory vs global memory CUDA Programming and Performance	6	3499	April 30, 2007
using shared memory CUDA Programming and Performance	6	2988	September 17, 2009
Shared Memory Access CUDA Programming and Performance	5	4636	May 24, 2007
Global memory access problem Can't figure out how to do it correctly CUDA Programming and Performance	2	2931	May 9, 2011
Mystery Memory Transfer? CUDA Programming and Performance	3	1146	February 17, 2012
Newnish doubt on cuda CUDA Programming and Performance	5	2475	February 11, 2008

Memory sharing across blocks

Related topics