Best data structure to optimize memory

Hi everyone,

I have a question. Consider a grid where each node has a number of variables, let’s say numVar. I allocate a vector vars where the variables of each node are contiguously stored, that is (to simplify consider numVar=3)

vars = [a1, b1, c1, a2, b2, c2, …, aNumVar, bNumVar, cNumVar]

Then I have a loop in time, where I proceed with 2 steps:
Step1: operate on each node of the grid ( I assign one thread for each node, that is, each thread operates on a b c of its node)
Step2: operate only on the first variable of each node

The problem is that after Step 1, I copy vars from device to host, and there I copy a1, a2…aNumVar in aux, upload it again to the GPU and proceed with Step 2. After Step 2 I copy aux (modified) from device to host, copy it into the right position into vars and then I upload it again to the device for the next time step. I have two copies of the size numVars*numNodes between CPU and GPU at each time step. I would like to improve my code, and I thought that this could be a good point to start. (I’m already using cudaMallocHost for vars and aux).

Then I started thinking about storing vars like vars=[a1,a2,…aNumVar, b1…]. With this I would avoid these 2 expensive copies between CPU-GPU, but then I would have to operate differently on my data, that is, using an stride of numNodes. Although the accesses would be coalesced, I don’t know if this would decrease the performance of the kernels that operate on Step1, since the threads won’t access contiguous pieces of memory. Which approach is better?