I have a very fancy new kernel which, if successful, will turn what is probably the hardest memory-bound part of the molecular simulations workflow (the particle-mesh mapping) into a compute-bound problem, eliminating the atomicAdd()
instructions that are also somewhat costly in the process. In particle-mesh mapping, the coordinates of any particle are converted into a mesh element index {a, b, c} and a fractional coordinate within that mesh element {da, db, dc}. Then, B-spline coefficients of order N are computed and applied to map the particle’s density to mesh points {a+0, b+0, c+0}, {a+1, b+0, c+0}, … {a+N-1, b+0, c+0}, …, {a+N-1, b+N-1, c+N-1}. The traditional way to do this is to cudaMemset()
the grid to zero, then have each particle fire off N^3 atomicAdd() operations at the applicable mesh points. For bitwise reproducibility, we use fixed-precision representations of the density and then use a second kernel to re-read the mesh and convert back to real-valued representations before using FFTs to convolute the density with a Green’s function, get a potential, and back-calculate the energy and force on each particle yadda yadda. But, the holdup here really is the particle-mesh density spreading: the FFTs are performant thanks to cuFFT and the backwards interpolation is much faster as there is no accumulation, atomic operations, or conversion to do.
So I’ve designed this thing in some sneaky ways, and while there are constraints on its applicability it does hit the regime we want in terms of mesh discretization and interpolation order(s). It will accumulate the entire density for particular mesh points in __shared__
, without the need to initialize with cudaMemset
or convert the data with a second kernel.
But it relies on a lot of __shared__
memory, the more the better, due to the fact that, for any given region of space where I want to completely calculate the density, the region of space that the density could come from is slightly larger. If the space I want to map is, say, 24 x 12 x 12 grid points (I’m using actual numbers here), then the space I would need to take contributions from has the volume of 28 x 16 x 16 grid points. So taking bigger and bigger chunks at one time is preferable and this is where the __shared__
memory limits come into play. I want to make sure that I have my facts straight:
My understanding of __shared__
memory is that it started as a “48kB __shared__
/ 16kB L1 or 16kB __shared__
/ 48kB L1” deal in the earliest CUDA devices, then generalized a bit to have the option of an even split out of the small 64kB L1. Then, L1 got bigger (128kB, where it remains today for the commodity devices–I’m going to ignore the data-center grade X100 cards with 192+kB of L1 for now), and we got more options: 0 kB __shared__
, or 16kB, 32kB, 64kB, 100kB (and 160kB on the newest X100 cards). These limits apply to each streaming multiprocessor, and any given thread block has always had a limit of 48kB of static __shared__
that it can address, but even on a commodity card we have the ability to rope off just over 3/4 of the total physical L1 space to work as __shared__
. And then there came dynamically allocated __shared__
memory, which in devices since perhaps Volta has allowed a single thread block to address the entire volume of __shared__
memory that the SM can support. It also needs to be understood that __shared__
and L1 are exclusive–requesting 32769 bytes of __shared__
across three thread blocks (yes, that number is divisible by three), would rope off 64kB as __shared__
, essentially wasting 1/4 of the total L1 space, so when you ask for __shared__
memory, mark sure you use it.
Given that bigger is better, both for load balancing and for the efficiency of mapping, I may want to have a single 1024-threaded block that utilizes the entire dynamically allocated __shared__
space. But, if I am allocating two blocks per streaming multiprocessor which each utilize just under 48kB of statically allocated __shared__
, is this going to work on recent (7.0 and later) devices?