I am writing a time marching finite difference program. I have two arrays, AA and BB. The data in BB can be espressed in (roughly) the following form
BB[ii] = AA[ii-1] + AA[ii] + AA[ii+1]
Let’s say I do this once, by launching a grid of blocks of threads and using the thread id as in index to the arrays. Now, I want to swap the arrays AA and BB and then repeat this process. To do this, I want to be assured that all threads in the grid (not just the block) have finished operating. Is there some sort of grid level synchronization or a hack to do it?. I read somewhere that the CUDA model does not allow grid level synchronization. Is this correct?
Currently, I execute the above process for one step on the GPU, come out on to the CPU swap the pointers of AA and BB and then restart the kernel on the GPU. I think this starting and restarting amounts to some loss of efficiency.
Is there a better way to do this?