Coalesced copy, strong typing, and equivalence

I have a large structure of “particles”. Each particle is 48 bytes long :

  type :: particle
    real r(3)                                             ! fractional positions
    real e                                                ! energy
    real p(3)                                             ! momenta
    real w                                                ! weight
    integer(kind=2) q(3)                                  ! integer positions
    integer(kind=2) bits                                  ! extra bits
    integer*8 i                                           ! id
  end type
  type(particle), dimension(np_total) :: gp               ! global particle array
  type(particle), dimension(np_stick) :: gs               ! stick particle array
  integer, dimension(nx,ny,nz) :: gi

and I can easily have ~4GB of particles on my C1060 . The particle array is sorted and organized in cells. I have an index array gi(nx,ny,nz) that tells at which index the particles in each cell sits.

I need to copy out a z-stick of particles from the GPU to the CPU. That is, all particles in cells with a certain (jx,jy) coordinate. To minimize Host-GPU data transfers I have a routine which selects the particles sitting in cells with a coordinate of (jx,jy) in the global array, gp, and copy them over to a continuous array gs. That array can then be transfered to the CPU.

Apart from the index-juggle what I want to do is to transfer particles from gp to gs in coalesced transfers.

Right now I have

jz = ... ! index in z-column
np = ... ! nr of particles in the (jx,jy,jz)-cell
offp = ... ! offset in gp array for cell (jx,jy,jz)
offs = ... ! offset in gs array to copy to
it = threadidx%x
if (it <= np) then
  ip = threadidx%x + offp ! index in global array of particle it in cell (jx,jy,jz)
  is = threadidx%x + offs ! index in stick array
  gs(is) = gp(ip)
endif

Essentially each thread copies one particle. That means each thread will transfer 48 bytes which is terrible for coalescing.

My question is how to do it in such a way that each thread tranfers 4-byte blocks, and the Cuda hardware is kept happy ?

I could only think of two ways, which unfortunately are not supported with the current standard.

If routines were not strongly typed, I would be much better of. Then instead of

type(particle), dimension(np_total) :: gp ! global particle array
type(particle), dimension(np_stick) :: gs ! stick particle array

I could do

integer, dimension(np_total12) :: gp ! global particle array
integer, dimension(np_stick
12) :: gs ! stick particle array

and copy away.

Alternatively with equivalence in my variable declaration I could do something like:

type(particle), dimension(np_total) :: gp ! global particle array
type(particle), dimension(np_stick) :: gs ! stick particle array
integer, dimension(np_total12) :: igp ! global particle array
integer, dimension(np_stick
12) :: igs ! stick particle array
equivalence :: igp, gp
equivalence :: igs, gs

and the problem was solved too. In general, for memory transfers the implementation of equivalence would help a lot.

Any suggestions how to accomplish coalesced transfers on type’d variables ?

thanks in advance,

Troels

Hi Troels,

Unfortunately, I can’t think of anything better. If I understand correctly, gp is scattered (i.e the offp is not sequential to the thread ids) so nothing can be done to help. Maybe you could do something with gs’s store, but I’m not sure it would be worth it.

  • Mat