best way to copy slices of multi-dimensional array to host

Hi,

in my computations I use a linearized 3D array and while all the number crunching is done entirely on the GPU I need to copy subdomains e.g. 2d slices or 3d volume subdomains of the field vectors to the host for intermediate plots and analysis.
I suppose that “acc update host()” only allows to specify a continuous sub-range of the array present on the device for copying to the host, e.g. a[4:10] but not [4:10,101:233, …]. Furthermore, the slices or subdomains are typically not contiguous due to the linearization of the 3D array.
What’s the most efficient way of copying over such selected data to the host. Do I create some sort of buffer array in which I copy the needed field values from the device array before copying the buffer array to the host where I then need to copy it into the host version of the 3D array? This is probably more efficient then doing “acc update device()” for every single array entry of interest. Or is there any other way of handling such a scenario?
If the subdomains and slices cover sufficiently many array entries compared with the total size of the array it is probably most efficient to copy over the whole array, but in my typical use case I am looking at just a number of 2d slices of a very large 3D grid.

Thanks,
LS

Hi Lutz,

Do I create some sort of buffer array in which I copy the needed field values from the device array before copying the buffer array to the host where I then need to copy it into the host version of the 3D array?

Yes, this is what I’ve done in the past when doing halo passing. You waste a bit of memory and computation time, but often it can improve overall performance since there are fewer memory transfers and the memory can be transferred in one large block.

Another thing to try is CUDA Unified Memory (-ta=tesla:managed). There’s a lot of caveats (See: http://www.pgroup.com/lit/articles/insider/v6n2a4.htm) but as the NVIDIA hardware improves and NVLINK becomes available most of the issues will go away.

Granted, a scatter/gather operation doesn’t take that much effort to implement, but trying CUDA Unified Memory first is worth a shot before rewriting code.

  • Mat


    \
  • Mat