array of matrices coded as structs

Dear All,

Sorry if this is a silly question, but the discussion of non-coalesced access to memory slowdown has got me really worried.
I’m working on porting some c++ code to run on a GPU using CUDA. Essentially my parallel task is equal operations on each
element in a long array of matrices encoded as structs; i.e. (n, m and np known at compiletime, but otherwise general)

struct nbym_mat{
float vals[n*m];

and the serial code looks something like

nbym_mat arr[np];
for(int i=0;i<np;i++){

I realize that I should write ‘some_function’ as a kernel and distribute the loop across threads on the GPU.
However, what would be the best way to encode this data-structure to make the memory access as fast as possible?

Best regards and thanks in advance,