This morning I once again found myself reading over the nuances of global memory writing intrinsics. I have a kernel that I’m working on which will proceed in three stages:
1.) Read some data about particles and perform some fairly lengthy pre-computations on it (about 150 FLOPs per thread). Pool the results in
2.) Loop over the data to see what sectors of a confined volume of space the particles processed in (1.) might affect. (This process is fairly quick, and I might even combine it with the first step.)
__syncthreads(). Store the results in short arrays of
3.) Loop over the affected sectors of the defined volume and loop over the particles processed in (1.) based on notes taken an stored in (2.). Compute the effects of the particles on a regular grid of points within the volume.
The question is now how I want to store the results of (3.). My first inclination was to allocate a separate array in
__shared__ so that, over many cycles, the results could be accumulated for eventual writing to
__global__ memory in one fell swoop, with a lot of well-coalesced transactions. However, accumulating the results in
__shared__ limits the size, and particularly the density, of the region I can afford to map. I could write a more general kernel if, instead, I could write the results directly to
__global__. This could also make better use of the remaining L1 cache. I am allocating as much as I can to
__shared__, and what is left is not being hit by any of my other memory transactions–I read with
__ldcv(). However, each grid point is not necessarily complete the first time it is written–the final value may change over subsequent cycles, but I can guarantee that these cycles will be done by the same thread block, and separated by
__syncthreads(). Would it be safe to write to
__global__ in this case, and let it use whatever L1 / L2 resources are available to expedite retrieval of a previously stored value?
As I said, it would make my kernel much more general, possibly a bit more performant given how infrequent the stores of results from (3.) are compared to the processing and
__shared__ memory accesses leading up to them. I have been told, by someone who knows a lot and quoted the manual, that
__syncthreads()will guarantee coherence in
__global__ writes at the level of the block, but I’ve also heard contrary opinions and another experienced CUDA programmer who insists one should not trust such coherence.