Procedure for problem sizes larger than GPU's DRAM (with managed memory)

I am considering expanding my program, which currently fits on a single GPU (a P100) for the problem sizes of interest, to larger problem sizes. However, to jump up in size (by a factor of two in each of the three dimensions of my problem) would exceed the DRAM space on the GPU - would jump to roughly 100GB.

My understanding of the general procedure is to divide up the problem (arrays) into small enough pieces such that two fit on the GPU at a time, and run each kernel on one sub-array while asynchronously(?) loading the next sub-array to be processed.

My general question is whether there are any canonical procedures/practices to be aware of. More specifically, I am also wondering what is the simplest (programming-wise) and most optimal way to use managed memory in this implementation.

My last concern is the treatment of periodic boundary conditions for kernels which depend on neighboring data. Of course, for kernels which are local, i.e. data at a particular index in the global array only depends on other data at that same index, this is not a concern. What would be the best treatment of periodic boundaries for non-local kernels? The one idea I have is to pad each sub-array, which would be moderately messy in terms of array indexing - is there a better way?

Or consider doing a wavelet compression and multiresolution approach in the data domain. SpaceX has done some amazing research here. Definitely not the simplest (programming wise) but sure very powerful.

It allows them to fit much more data on the GPU than they could otherwise.

Thanks, but I’m only looking to scale my current program, not change the methodology.