This morning I once again found myself reading over the nuances of global memory writing intrinsics. I have a kernel that I’m working on which will proceed in three stages:
1.) Read some data about particles and perform some fairly lengthy pre-computations on it (about 150 FLOPs per thread). Pool the results in __shared__
. __syncthreads()
.
2.) Loop over the data to see what sectors of a confined volume of space the particles processed in (1.) might affect. (This process is fairly quick, and I might even combine it with the first step.) __syncthreads()
. Store the results in short arrays of __shared__
.
3.) Loop over the affected sectors of the defined volume and loop over the particles processed in (1.) based on notes taken an stored in (2.). Compute the effects of the particles on a regular grid of points within the volume.
The question is now how I want to store the results of (3.). My first inclination was to allocate a separate array in __shared__
so that, over many cycles, the results could be accumulated for eventual writing to __global__
memory in one fell swoop, with a lot of well-coalesced transactions. However, accumulating the results in __shared__
limits the size, and particularly the density, of the region I can afford to map. I could write a more general kernel if, instead, I could write the results directly to __global__
. This could also make better use of the remaining L1 cache. I am allocating as much as I can to __shared__
, and what is left is not being hit by any of my other memory transactions–I read with __ldcv()
. However, each grid point is not necessarily complete the first time it is written–the final value may change over subsequent cycles, but I can guarantee that these cycles will be done by the same thread block, and separated by __syncthreads()
. Would it be safe to write to __global__
in this case, and let it use whatever L1 / L2 resources are available to expedite retrieval of a previously stored value?
As I said, it would make my kernel much more general, possibly a bit more performant given how infrequent the stores of results from (3.) are compared to the processing and __shared__
memory accesses leading up to them. I have been told, by someone who knows a lot and quoted the manual, that __syncthreads()
will guarantee coherence in __global__
writes at the level of the block, but I’ve also heard contrary opinions and another experienced CUDA programmer who insists one should not trust such coherence.