I am trying to port a C code which uses a Runge-Kutta scheme to CUDA. I have created a theoretical micro-problem to help guide me:
- generate a linear array on the host with the beginning of the array representing the initial conditions, say t = 0. (and the rest is 0’s)
2)transfer linear array to device - Identify the next time step in the array (say t = 1) and split this section in half (these halves represent the data that will be used by blockIdx.x = 0 and blockIdx.x = 1).
- Simultaneously calculate both halves of the domain using 2 blocks.
Steps 3 and 4 can be repeated to generate as many timesteps as needed, but this is where I am perplexed. For t = 2, the R-K scheme used by each block will have to refer to data that has been generated by the other block. This happens right at the halfway point where the approximation method needs to refer to cells on the left and right( from the previous timestep). For blockIdx.x = 0, at any t > 1, this requires referencing the data that is stored the first cell of blockIdx.x = 1 (at the previous timestep). I anticipated having to use a creative ghost cell technique to account for the fact that the memory of the 2 blocks does not overlap, but so far my results have been good.
Am I wrong in thinking that each block has it’s own memory?