Dare I use L1 in this way?

dscerutti · October 3, 2023, 2:54pm

This morning I once again found myself reading over the nuances of global memory writing intrinsics. I have a kernel that I’m working on which will proceed in three stages:
1.) Read some data about particles and perform some fairly lengthy pre-computations on it (about 150 FLOPs per thread). Pool the results in __shared__. __syncthreads().
2.) Loop over the data to see what sectors of a confined volume of space the particles processed in (1.) might affect. (This process is fairly quick, and I might even combine it with the first step.) __syncthreads(). Store the results in short arrays of __shared__.
3.) Loop over the affected sectors of the defined volume and loop over the particles processed in (1.) based on notes taken an stored in (2.). Compute the effects of the particles on a regular grid of points within the volume.

The question is now how I want to store the results of (3.). My first inclination was to allocate a separate array in __shared__ so that, over many cycles, the results could be accumulated for eventual writing to __global__ memory in one fell swoop, with a lot of well-coalesced transactions. However, accumulating the results in __shared__ limits the size, and particularly the density, of the region I can afford to map. I could write a more general kernel if, instead, I could write the results directly to __global__. This could also make better use of the remaining L1 cache. I am allocating as much as I can to __shared__, and what is left is not being hit by any of my other memory transactions–I read with __ldcv(). However, each grid point is not necessarily complete the first time it is written–the final value may change over subsequent cycles, but I can guarantee that these cycles will be done by the same thread block, and separated by __syncthreads(). Would it be safe to write to __global__ in this case, and let it use whatever L1 / L2 resources are available to expedite retrieval of a previously stored value?

As I said, it would make my kernel much more general, possibly a bit more performant given how infrequent the stores of results from (3.) are compared to the processing and __shared__ memory accesses leading up to them. I have been told, by someone who knows a lot and quoted the manual, that __syncthreads()will guarantee coherence in __global__ writes at the level of the block, but I’ve also heard contrary opinions and another experienced CUDA programmer who insists one should not trust such coherence.

dscerutti · October 14, 2023, 6:53pm

Actually, I realized that I have been doing precisely this in other contexts. It can be quite fast, though perhaps a few percent slower than to use __shared__ for the same purpose. If the traffic is light and the space is not taking a lot of atomic adds (which pull L2 into the picture) then it is viable. However, other discoveries have led me to conclude that the algorithm which I was considering is itself not what I want to do, regardless of the Wild West L1 usage, so I am going to close this topic.

Topic		Replies	Views
__syncthreads() and global memory CUDA Programming and Performance	1	2521	December 1, 2008
How to write the large amount of result from threads performance effectively in cuda CUDA Programming and Performance	6	1248	October 9, 2014
numbers of write to global memory for each thread CUDA Programming and Performance	3	2163	March 31, 2008
Conditional write to global memory CUDA Programming and Performance	3	3541	September 4, 2007
Good programming practice Writing shared & global memory CUDA Programming and Performance	13	8121	July 20, 2007
Writes in same memory location Cant add numbers from different threads? CUDA Programming and Performance	46	26109	July 5, 2007
Global memory write cost CUDA Programming and Performance	4	8036	March 11, 2011
Memory coalescing in one thread CUDA Programming and Performance	17	16874	March 31, 2011
Non coalesced read/write in global vs shared CUDA Programming and Performance	12	4738	May 12, 2015
Memory Instructions (kernel) device <-> shared, device <-> device CUDA Programming and Performance	4	2566	October 10, 2007

Dare I use L1 in this way?

Related topics