Should I use dynamic or statically allocated shared memory?

dscerutti · September 24, 2023, 5:39pm

I have a very fancy new kernel which, if successful, will turn what is probably the hardest memory-bound part of the molecular simulations workflow (the particle-mesh mapping) into a compute-bound problem, eliminating the atomicAdd() instructions that are also somewhat costly in the process. In particle-mesh mapping, the coordinates of any particle are converted into a mesh element index {a, b, c} and a fractional coordinate within that mesh element {da, db, dc}. Then, B-spline coefficients of order N are computed and applied to map the particle’s density to mesh points {a+0, b+0, c+0}, {a+1, b+0, c+0}, … {a+N-1, b+0, c+0}, …, {a+N-1, b+N-1, c+N-1}. The traditional way to do this is to cudaMemset() the grid to zero, then have each particle fire off N^3 atomicAdd() operations at the applicable mesh points. For bitwise reproducibility, we use fixed-precision representations of the density and then use a second kernel to re-read the mesh and convert back to real-valued representations before using FFTs to convolute the density with a Green’s function, get a potential, and back-calculate the energy and force on each particle yadda yadda. But, the holdup here really is the particle-mesh density spreading: the FFTs are performant thanks to cuFFT and the backwards interpolation is much faster as there is no accumulation, atomic operations, or conversion to do.

So I’ve designed this thing in some sneaky ways, and while there are constraints on its applicability it does hit the regime we want in terms of mesh discretization and interpolation order(s). It will accumulate the entire density for particular mesh points in __shared__, without the need to initialize with cudaMemset or convert the data with a second kernel.

But it relies on a lot of __shared__ memory, the more the better, due to the fact that, for any given region of space where I want to completely calculate the density, the region of space that the density could come from is slightly larger. If the space I want to map is, say, 24 x 12 x 12 grid points (I’m using actual numbers here), then the space I would need to take contributions from has the volume of 28 x 16 x 16 grid points. So taking bigger and bigger chunks at one time is preferable and this is where the __shared__ memory limits come into play. I want to make sure that I have my facts straight:

My understanding of __shared__ memory is that it started as a “48kB __shared__ / 16kB L1 or 16kB __shared__ / 48kB L1” deal in the earliest CUDA devices, then generalized a bit to have the option of an even split out of the small 64kB L1. Then, L1 got bigger (128kB, where it remains today for the commodity devices–I’m going to ignore the data-center grade X100 cards with 192+kB of L1 for now), and we got more options: 0 kB __shared__, or 16kB, 32kB, 64kB, 100kB (and 160kB on the newest X100 cards). These limits apply to each streaming multiprocessor, and any given thread block has always had a limit of 48kB of static __shared__ that it can address, but even on a commodity card we have the ability to rope off just over 3/4 of the total physical L1 space to work as __shared__. And then there came dynamically allocated __shared__ memory, which in devices since perhaps Volta has allowed a single thread block to address the entire volume of __shared__ memory that the SM can support. It also needs to be understood that __shared__ and L1 are exclusive–requesting 32769 bytes of __shared__ across three thread blocks (yes, that number is divisible by three), would rope off 64kB as __shared__, essentially wasting 1/4 of the total L1 space, so when you ask for __shared__ memory, mark sure you use it.

Given that bigger is better, both for load balancing and for the efficiency of mapping, I may want to have a single 1024-threaded block that utilizes the entire dynamically allocated __shared__ space. But, if I am allocating two blocks per streaming multiprocessor which each utilize just under 48kB of statically allocated __shared__, is this going to work on recent (7.0 and later) devices?

rs277 · September 24, 2023, 6:14pm

The Occupancy calculator spreadsheet would be useful for you to check things out, but has been removed from the site. njuffa located it in the Web Archive.

If I understand your requirements, using this for CC7.0, with maximum shared memory of 98304 bytes, 2 active 1024 thread blocks/SM and 47.5kB shared memory/block, you are limited to a maximum of 32 registers/thread, rising to 64 with only one block active.

Edit: Taking your comment:

The above doesn’t apply to CC7.5 or 8.6 cards - limit is one block.

dscerutti · September 24, 2023, 6:35pm

Thanks, @rs277. What do you mean by “limit is one block?” Are compute capability 7.5 and 8.6 cards limited to less total __shared__ memory than 64kB, 100kB per SM? When I talked about multiple blocks utilizing 48kB __shared__ each, I was thinking two blocks of 512 threads each utilizing up to 48kB static __shared__, or one block of 1024 threads utilizing the entirety of the dynamically available __shared__. I am keeping to 64 registers per thread–may reach for a few more threads in two blocks of 576 threads each on 8.6 if I can get away with 56 registers per thread. But, if I can’t utilize 96-100 kB of total __shared__ on 7.5 / 8.6 commodity cards then I may have to go back to the drawing board.

rs277 · September 24, 2023, 6:50pm

Limited to only one 1024 thread block being active per SM.

No, the CCX.0 cards allow 64 warps/SM, whereas the consumer cards only 32/48.

You should be OK in either of those scenarios, up to 64 registers/thread. Grab the XLS calculator linked above and have a play.

dscerutti · September 24, 2023, 6:55pm

Oh, thank goodness.

Good point about the CC7.5/7.6 Turing cards being limited to 32 total warps per SM–it’s a minor tweak but very doable. On 8.6 I can still use up to 48 warps per SM, iirc, so my plans for 2 x 576-threaded blocks there are safe.

Also, further down in the thread, I can perhaps mention that the __ffs() intrinsic is critical to my strategy when you get towards the end of the process. I would like to thank the CUDA engineers for developing a function called ffs.

rs277 · September 24, 2023, 7:05pm

Yes, I’ve amended the post above.

system · October 8, 2023, 7:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
are the threads of a warp run serially？ CUDA Programming and Performance	9	1105	February 27, 2020
Amount of Shared Memory CUDA Programming and Performance	10	4209	June 3, 2010
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5901	July 25, 2007
Why some algorithm uses small block? Obviously the larger the better!? CUDA Programming and Performance	10	1024	September 28, 2023
shared memory and CUDA calculator CUDA Programming and Performance	6	4041	October 26, 2008
Optimisation of occupancy summary table CUDA Programming and Performance	10	811	September 15, 2023
How to set up shared memory allocated per block for a 3D structured data? CUDA Programming and Performance cuda	9	30	May 15, 2025
The choose of grid size and block size CUDA Programming and Performance	8	3229	May 8, 2024
Block size and occupancy CUDA Programming and Performance	12	137	January 2, 2025
Doubts related to CUDA CUDA Programming and Performance	17	11808	November 18, 2010

Should I use dynamic or statically allocated __shared__ memory?

Related topics

Should I use dynamic or statically allocated shared memory?