I have a variation of the CUDA N-body example, instead of one sub-particle type (one atom), I have two sub-particle types (2 atoms) for each particle (molecule). I use the float3 type to store the x, y, z co-ordinates . I have stored the co-ordinates in memory as follows:
atom0_mol0 x, atom0_mol0 y, atom0_mol0 z
atom1_mol0 x, atom1_mol0 y, atom1_mol0 z
atom0_mol1 x, atom0_mol1 y, atom0_mol1 z
atom1_mol1 x, atom1_mol1 y, atom1_mol1 z
Similar to the N-body example, each thread uses its global thread ID to store the co-ordinates for the two atoms in a molecule. And then threads in a block share the task of copying a molecule’s co-ordinates to shared memory (which all the threads will access later) much like the N-body example. In spite of trying a few different access patterns for the loads from the global memory to retrieve the co-ordinates, I have not been able to achieve coalesced accesses in this case (as reported by the Compute Visual Profiler). Any hints on this will be appreciated.
I replaced with float4 but that did not help alleviate the coalescing problem. I also tried the following storage pattern:
atom0_mol0 x atom0_mol0 y atom0_mol0 z
atom0_mol1 x atom0_mol1 y atom0_mol1 z
and so on
followed by
atom1_mol0 x atom1_mol0 y atom1_mol0 z
atom1_mol1 x atom1_mol1 y atom1_mol1 z
and so on
I was thinking that in the above case each thread will read the contiguous locations when retrieving the same atom type. I still see CVP reporting non-coalesced accesses.
Casting to float4 alone does not solve the issue either. You also have to reassign the accesses to threads so that each thread reads 4 consecutive floats from memory, with the next thread reading the following 4 floats and so on. Then the optimizer will usually be clever enough to replace these by a single 128 bit/thread memory transaction.
If you are free to rearrange the data in memory, then the simplest thing probably is to just change the layout to this one:
atom0_mol0 x atom0_mol1 x atom0_mol2 x …
atom1_mol0 x atom1_mol1 x atom1_mol2 x …
…
atom0_mol0 y atom0_mol1 y atom0_mol2 y …
atom1_mol0 y atom1_mol1 y atom1_mol2 y …
…
atom0_mol0 z atom0_mol1 z atom0_mol2 z …
atom1_mol0 z atom1_mol1 z atom1_mol2 z …
…
It will not automatically lead you 64 bit/thread or 128 bit/thread transactions, but gives coalesced and bank conflict free memory accesses without further thought.
Thanks for the reply. Just to make sure I understand it right, for the NVIDIA Fermi architecture, a warp accesses the memory concurrently, hence, the following
atom0_mol0 x atom0_mol1 x atom0_mol2 x … atom0_mol31x
So if I have 32 or more threads in a block, in a warp, thread0 accesses atom0_mol0 x, thread 1 accesses atom0_mol1x and so on.
the above will be grouped into a (32*32-bit)/8 = 128B transaction?