Structs and arrays (AoS and SoA) in CUDA, specifically finite-differencing

I have a finite difference code which works with several (usually two) NxNxN arrays, each with a number (call it m, usually between 1 and 10) of values at each site of the array. I find that storing the m distinct values adjacently in memory is fastest, even though kernel memory accesses are not contiguous. That is, the threads of my 3D kernel loop over loading each of their m local values, so I imagine the global reads are striding by m values. However, the integration of each of the m variables requires all of the other m-1 local values, which I think is why keeping them close is optimal. I store the arrays as 1D arrays using an indexing function. Since this is finite difference, I probably have a ton of bank conflicts when accessing shared memory to compute stencils.

I’ve been annoyed at having some functions which process data assuming the passed arrays are shared memory arrays, and some that assume they’re the global arrays. I figure that passing structures of the m local values to these functions would resolve this issue, i.e. “abstract” the function’s definition. So instead of having the m local values be just the fastest index of my multi-dimensional array (which again is coded as a 1D array), I would have a single NxNxN of, say, two structures holding the m local values. I would structure a kernel’s shared memory identically.

I’m struggling to understand the effect this would have on memory transfer, i.e. kernels loading data into shared memory. I’m also trying to figure out what the optimal approach is, since the stencil computations generate many bank conflicts. I’ve been confused by comparisons of AoS and SoA programming, especially the claims that AoS is bad for HPC. Any guidance, references, or examples would be appreciated

shameless bump =(