Memory Coalescing

I have a variation of the CUDA N-body example, instead of one sub-particle type (one atom), I have two sub-particle types (2 atoms) for each particle (molecule). I use the float3 type to store the x, y, z co-ordinates . I have stored the co-ordinates in memory as follows:

atom0_mol0 x, atom0_mol0 y, atom0_mol0 z
atom1_mol0 x, atom1_mol0 y, atom1_mol0 z

atom0_mol1 x, atom0_mol1 y, atom0_mol1 z
atom1_mol1 x, atom1_mol1 y, atom1_mol1 z

Similar to the N-body example, each thread uses its global thread ID to store the co-ordinates for the two atoms in a molecule. And then threads in a block share the task of copying a molecule’s co-ordinates to shared memory (which all the threads will access later) much like the N-body example. In spite of trying a few different access patterns for the loads from the global memory to retrieve the co-ordinates, I have not been able to achieve coalesced accesses in this case (as reported by the Compute Visual Profiler). Any hints on this will be appreciated.

You will probably have to recast temporarily to a different type (float, float2 or float4) to achieve the necessary reshuffling.

I replaced with float4 but that did not help alleviate the coalescing problem. I also tried the following storage pattern:

atom0_mol0 x atom0_mol0 y atom0_mol0 z

atom0_mol1 x atom0_mol1 y atom0_mol1 z

and so on

followed by

atom1_mol0 x atom1_mol0 y atom1_mol0 z

atom1_mol1 x atom1_mol1 y atom1_mol1 z

and so on

I was thinking that in the above case each thread will read the contiguous locations when retrieving the same atom type. I still see CVP reporting non-coalesced accesses.

Casting to float4 alone does not solve the issue either. You also have to reassign the accesses to threads so that each thread reads 4 consecutive floats from memory, with the next thread reading the following 4 floats and so on. Then the optimizer will usually be clever enough to replace these by a single 128 bit/thread memory transaction.

If you are free to rearrange the data in memory, then the simplest thing probably is to just change the layout to this one:

atom0_mol0 x atom0_mol1 x atom0_mol2 x …
atom1_mol0 x atom1_mol1 x atom1_mol2 x …

atom0_mol0 y atom0_mol1 y atom0_mol2 y …
atom1_mol0 y atom1_mol1 y atom1_mol2 y …

atom0_mol0 z atom0_mol1 z atom0_mol2 z …
atom1_mol0 z atom1_mol1 z atom1_mol2 z …

It will not automatically lead you 64 bit/thread or 128 bit/thread transactions, but gives coalesced and bank conflict free memory accesses without further thought.

Thanks for the reply. Just to make sure I understand it right, for the NVIDIA Fermi architecture, a warp accesses the memory concurrently, hence, the following

atom0_mol0 x atom0_mol1 x atom0_mol2 x … atom0_mol31x

So if I have 32 or more threads in a block, in a warp, thread0 accesses atom0_mol0 x, thread 1 accesses atom0_mol1x and so on.

the above will be grouped into a (32*32-bit)/8 = 128B transaction?

Similarly for the other atoms and co-ordinates?