Structs and arrays (AoS and SoA) in CUDA, specifically finite-differencing

zjw518 · August 22, 2017, 4:39pm

I have a finite difference code which works with several (usually two) NxNxN arrays, each with a number (call it m, usually between 1 and 10) of values at each site of the array. I find that storing the m distinct values adjacently in memory is fastest, even though kernel memory accesses are not contiguous. That is, the threads of my 3D kernel loop over loading each of their m local values, so I imagine the global reads are striding by m values. However, the integration of each of the m variables requires all of the other m-1 local values, which I think is why keeping them close is optimal. I store the arrays as 1D arrays using an indexing function. Since this is finite difference, I probably have a ton of bank conflicts when accessing shared memory to compute stencils.

I’ve been annoyed at having some functions which process data assuming the passed arrays are shared memory arrays, and some that assume they’re the global arrays. I figure that passing structures of the m local values to these functions would resolve this issue, i.e. “abstract” the function’s definition. So instead of having the m local values be just the fastest index of my multi-dimensional array (which again is coded as a 1D array), I would have a single NxNxN of, say, two structures holding the m local values. I would structure a kernel’s shared memory identically.

I’m struggling to understand the effect this would have on memory transfer, i.e. kernels loading data into shared memory. I’m also trying to figure out what the optimal approach is, since the stencil computations generate many bank conflicts. I’ve been confused by comparisons of AoS and SoA programming, especially the claims that AoS is bad for HPC. Any guidance, references, or examples would be appreciated

zjw518 · August 29, 2017, 5:08pm

shameless bump =(

Topic		Replies	Views
Structures of Arrays vs Arrays of Structures? CUDA Programming and Performance	2	10715	December 5, 2009
Loading structured data efficiently using CUDA can this be right? CUDA Programming and Performance	8	24295	November 9, 2009
Why AoS faster in accessing global memory? CUDA Programming and Performance	6	1595	August 19, 2014
Memory, Structs, arrays, etc... CUDA Programming and Performance	0	2285	October 1, 2009
Is array of int3 gives performance improvement? CUDA Programming and Performance	4	7125	June 21, 2009
coalesced access of structure of array CUDA Programming and Performance	3	896	August 13, 2014
efficient static arrays in kernel CUDA Programming and Performance	2	2307	March 31, 2009
Understanding shared memory bank conflict for struct CUDA Programming and Performance cuda	1	565	January 21, 2022
CUDA with Fermi: Array of structs or arrays? Which is more efficient for memory access? CUDA Programming and Performance	9	14627	October 28, 2010
performance for global and shared memory CUDA Programming and Performance	2	6232	January 15, 2008

Structs and arrays (AoS and SoA) in CUDA, specifically finite-differencing

Related topics