Assuming that this is an extended version of the program you posted at: OpenACC: Best way to parallelize nested DO loops with data dependency between loops?
It probably has some impact but not much. With CUDA Unified memory (-gpu=managed), the data will only be copied when it’s changed on either the host or device. So in this case it only gets copied once the first time through the timestep loop. So you can hoist the data movement before the timestep using explicit data regions so it doesn’t get included in your timer, but the overall time would be about the same.
More likely the problem is the same as your other program in the “nblocks” is only one so there’s not enough work for the GPU. I used a nblock size of 128 and see significant speed-up, though I don’t know if that’s a reasonable size.
-Mat