I have a large structure of “particles”. Each particle is 48 bytes long :

```
type :: particle
real r(3) ! fractional positions
real e ! energy
real p(3) ! momenta
real w ! weight
integer(kind=2) q(3) ! integer positions
integer(kind=2) bits ! extra bits
integer*8 i ! id
end type
type(particle), dimension(np_total) :: gp ! global particle array
type(particle), dimension(np_stick) :: gs ! stick particle array
integer, dimension(nx,ny,nz) :: gi
```

and I can easily have ~4GB of particles on my C1060 . The particle array is sorted and organized in cells. I have an index array gi(nx,ny,nz) that tells at which index the particles in each cell sits.

I need to copy out a z-stick of particles from the GPU to the CPU. That is, all particles in cells with a certain (jx,jy) coordinate. To minimize Host-GPU data transfers I have a routine which selects the particles sitting in cells with a coordinate of (jx,jy) in the global array, gp, and copy them over to a continuous array gs. That array can then be transfered to the CPU.

Apart from the index-juggle what I want to do is to transfer particles from gp to gs in coalesced transfers.

Right now I have

```
jz = ... ! index in z-column
np = ... ! nr of particles in the (jx,jy,jz)-cell
offp = ... ! offset in gp array for cell (jx,jy,jz)
offs = ... ! offset in gs array to copy to
it = threadidx%x
if (it <= np) then
ip = threadidx%x + offp ! index in global array of particle it in cell (jx,jy,jz)
is = threadidx%x + offs ! index in stick array
gs(is) = gp(ip)
endif
```

Essentially each thread copies one particle. That means each thread will transfer 48 bytes which is terrible for coalescing.

My question is how to do it in such a way that each thread tranfers 4-byte blocks, and the Cuda hardware is kept happy ?

I could only think of two ways, which unfortunately are not supported with the current standard.

If routines were not strongly typed, I would be much better of. Then instead of

type(particle), dimension(np_total) :: gp ! global particle array

type(particle), dimension(np_stick) :: gs ! stick particle array

I could do

integer, dimension(np_total*12) :: gp ! global particle array
integer, dimension(np_stick*12) :: gs ! stick particle array

and copy away.

Alternatively with equivalence in my variable declaration I could do something like:

type(particle), dimension(np_total) :: gp ! global particle array

type(particle), dimension(np_stick) :: gs ! stick particle array

integer, dimension(np_total*12) :: igp ! global particle array
integer, dimension(np_stick*12) :: igs ! stick particle array

equivalence :: igp, gp

equivalence :: igs, gs

and the problem was solved too. In general, for memory transfers the implementation of equivalence would help a lot.

Any suggestions how to accomplish coalesced transfers on type’d variables ?

thanks in advance,

Troels