Mystery Memory Transfer?

I am trying to port a C code which uses a Runge-Kutta scheme to CUDA. I have created a theoretical micro-problem to help guide me:

  1. generate a linear array on the host with the beginning of the array representing the initial conditions, say t = 0. (and the rest is 0’s)
    2)transfer linear array to device
  2. Identify the next time step in the array (say t = 1) and split this section in half (these halves represent the data that will be used by blockIdx.x = 0 and blockIdx.x = 1).
  3. Simultaneously calculate both halves of the domain using 2 blocks.

Steps 3 and 4 can be repeated to generate as many timesteps as needed, but this is where I am perplexed. For t = 2, the R-K scheme used by each block will have to refer to data that has been generated by the other block. This happens right at the halfway point where the approximation method needs to refer to cells on the left and right( from the previous timestep). For blockIdx.x = 0, at any t > 1, this requires referencing the data that is stored the first cell of blockIdx.x = 1 (at the previous timestep). I anticipated having to use a creative ghost cell technique to account for the fact that the memory of the 2 blocks does not overlap, but so far my results have been good.

Am I wrong in thinking that each block has it’s own memory?

If these blocks write to global memory, it’s visible for both of them, but you need to synchronize to read a value .

So I’ve found that my results do blow up around the “dividers” between blocks. It just took more time steps than I would have guessed. Would it be worth it to write to global memory and sync? or would this cost me the time savings that the parallel blocks generate?

More than worth it, you need to do it. The time steps are synchronous for every point of your array. The most common solution is use an input and output buffer to read/write the data and swap the buffer at each call to the kernel. Also try using texture memory for the input data and compare the results with the naive global memory implementation. I even would try to use surface memory.