Coalesced structures possible?

I’am a Cuda beginner and have a question about coalesced global memory access and structures because I think this is the bottleneck in my programme…

I have a “Cell”-struct with 1x4-byte int, 3xfloat, 1xfloat[9], so 13*4=52Byte, right?

struct Cell{
  int solid;
  float ux;
  float uy;
  float rho;
  float f[9];
};

Every thread in my kernel reads one of this Cells, with:

Cell currentCell = d_cells[id];

So far as I read it in some forums and guides it would be better to split up this structure in Arrays, so instead of

Cell d_cells[numCells];

I should use sth like:

int solid[numCells];
float ux[numCells];
float uy[numCells];
...
float f8[numCells];
float f9[numCells];

Is there a way to avoid this… It’s much more comfortable with the one Cell-Array instead of 13 int/float arrays sigh And what is the difference between the two ways? I tried to use sth like padding to end up with a 64byte struct… But that slowed down my computation…
And I’m using compute capability 5.2. Does that make a difference? Because all I could find just mention a difference between 1.x and 2.x…
And what about writing to global memory? Do the reads and writes have to be coalesced separately or do I have to consider them together?

thanks in advance
cat

  1. cuda manual says about coalescing rules up to compute capability 3.x, but 2.x and 3.x doesn’t changed much, and probably 5.x is just the same as 3.x
  2. float f[9][numCells] should work too, if numCells is a constant

Now how it works: “Cell currentCell = d_cells[id]” is translated into series of assignments:

currentCell.solid = d_cells[id].solid

even if id is sequential in sequential threads, then 32 threads reading “solid” field read from adresses x, x+52, x+252…, that means that each address should be read separately - i.e. reading “solid” needs 32 cycles, and entire currentCell reads in as much as 1332 cycles

OTOH, if you declare solid[numCells] and run currentCell.solid = solid[id], this statement will read from adresses x, x+4, x+8… and entire 32 threads reads 128 sequential bytes, which is just one memory operation (or two operations if your data aren’t 128-byte aligned)

altghough currentCell = d_cells[id] assignment essentially reads the same 13*128 bytes across 32 threads of warp, GPU can’t coalesce these operations since the first 52 bytes should go into the first thread and so on. It’s possible to reorder data through shared memory, if you don’t want to use SoA, but of course it’s somewhat slower and needs shared memory. if currentCell should be kept in shared memory anyway, it will cost just an extra __syncthreads() call

thank you so far :)

so you propose that I could allocate the memory on the device as seperate Arrays and load them in a shared-memory-Cell-struct? Sounds like a good idea :)

If I want to use float[9][numcells], how would I allocate and copy that from HtoD? With the cudaMallocPitch function? I was toldl that it is difficult to use double-pointers (float**) on device…

Maybe these two links are relevant to your problem

trove library, full speed access to array of structures

and a paper called “Abstraction for AoS and SoA Layout in C++”
http://asc.ziti.uni-heidelberg.de/sites/default/files/research/papers/info/St11ASX_CUDA.htm

can anyone propose better links?