Dare I use L1 in this way?

This morning I once again found myself reading over the nuances of global memory writing intrinsics. I have a kernel that I’m working on which will proceed in three stages:
1.) Read some data about particles and perform some fairly lengthy pre-computations on it (about 150 FLOPs per thread). Pool the results in __shared__. __syncthreads().
2.) Loop over the data to see what sectors of a confined volume of space the particles processed in (1.) might affect. (This process is fairly quick, and I might even combine it with the first step.) __syncthreads(). Store the results in short arrays of __shared__.
3.) Loop over the affected sectors of the defined volume and loop over the particles processed in (1.) based on notes taken an stored in (2.). Compute the effects of the particles on a regular grid of points within the volume.

The question is now how I want to store the results of (3.). My first inclination was to allocate a separate array in __shared__ so that, over many cycles, the results could be accumulated for eventual writing to __global__ memory in one fell swoop, with a lot of well-coalesced transactions. However, accumulating the results in __shared__ limits the size, and particularly the density, of the region I can afford to map. I could write a more general kernel if, instead, I could write the results directly to __global__. This could also make better use of the remaining L1 cache. I am allocating as much as I can to __shared__, and what is left is not being hit by any of my other memory transactions–I read with __ldcv(). However, each grid point is not necessarily complete the first time it is written–the final value may change over subsequent cycles, but I can guarantee that these cycles will be done by the same thread block, and separated by __syncthreads(). Would it be safe to write to __global__ in this case, and let it use whatever L1 / L2 resources are available to expedite retrieval of a previously stored value?

As I said, it would make my kernel much more general, possibly a bit more performant given how infrequent the stores of results from (3.) are compared to the processing and __shared__ memory accesses leading up to them. I have been told, by someone who knows a lot and quoted the manual, that __syncthreads()will guarantee coherence in __global__ writes at the level of the block, but I’ve also heard contrary opinions and another experienced CUDA programmer who insists one should not trust such coherence.

Actually, I realized that I have been doing precisely this in other contexts. It can be quite fast, though perhaps a few percent slower than to use __shared__ for the same purpose. If the traffic is light and the space is not taking a lot of atomic adds (which pull L2 into the picture) then it is viable. However, other discoveries have led me to conclude that the algorithm which I was considering is itself not what I want to do, regardless of the Wild West L1 usage, so I am going to close this topic.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.