Dear All,

## Sorry if this is a silly question, but the discussion of non-coalesced access to memory slowdown has got me really worried.

I’m working on porting some c++ code to run on a GPU using CUDA. Essentially my parallel task is equal operations on each

element in a long array of matrices encoded as structs; i.e. (n, m and np known at compiletime, but otherwise general)

## struct nbym_mat{

float vals[n*m];

};

## and the serial code looks something like

## nbym_mat arr[np];

for(int i=0;i<np;i++){

some_function(arr[i]);

}

I realize that I should write ‘some_function’ as a kernel and distribute the loop across threads on the GPU.

However, what would be the best way to encode this data-structure to make the memory access as fast as possible?

Best regards and thanks in advance,

Tore