Memory Coalescing

danielz · October 12, 2011, 6:06pm

I have a variation of the CUDA N-body example, instead of one sub-particle type (one atom), I have two sub-particle types (2 atoms) for each particle (molecule). I use the float3 type to store the x, y, z co-ordinates . I have stored the co-ordinates in memory as follows:

atom0_mol0 x, atom0_mol0 y, atom0_mol0 z
atom1_mol0 x, atom1_mol0 y, atom1_mol0 z

atom0_mol1 x, atom0_mol1 y, atom0_mol1 z
atom1_mol1 x, atom1_mol1 y, atom1_mol1 z

Similar to the N-body example, each thread uses its global thread ID to store the co-ordinates for the two atoms in a molecule. And then threads in a block share the task of copying a molecule’s co-ordinates to shared memory (which all the threads will access later) much like the N-body example. In spite of trying a few different access patterns for the loads from the global memory to retrieve the co-ordinates, I have not been able to achieve coalesced accesses in this case (as reported by the Compute Visual Profiler). Any hints on this will be appreciated.

tera · October 12, 2011, 7:38pm

You will probably have to recast temporarily to a different type (float, float2 or float4) to achieve the necessary reshuffling.

danielz · October 12, 2011, 8:30pm

I replaced with float4 but that did not help alleviate the coalescing problem. I also tried the following storage pattern:

atom0_mol0 x atom0_mol0 y atom0_mol0 z

atom0_mol1 x atom0_mol1 y atom0_mol1 z

and so on

followed by

atom1_mol0 x atom1_mol0 y atom1_mol0 z

atom1_mol1 x atom1_mol1 y atom1_mol1 z

and so on

I was thinking that in the above case each thread will read the contiguous locations when retrieving the same atom type. I still see CVP reporting non-coalesced accesses.

tera · October 12, 2011, 8:51pm

Casting to float4 alone does not solve the issue either. You also have to reassign the accesses to threads so that each thread reads 4 consecutive floats from memory, with the next thread reading the following 4 floats and so on. Then the optimizer will usually be clever enough to replace these by a single 128 bit/thread memory transaction.

If you are free to rearrange the data in memory, then the simplest thing probably is to just change the layout to this one:

atom0_mol0 x atom0_mol1 x atom0_mol2 x …
atom1_mol0 x atom1_mol1 x atom1_mol2 x …
…
atom0_mol0 y atom0_mol1 y atom0_mol2 y …
atom1_mol0 y atom1_mol1 y atom1_mol2 y …
…
atom0_mol0 z atom0_mol1 z atom0_mol2 z …
atom1_mol0 z atom1_mol1 z atom1_mol2 z …
…

It will not automatically lead you 64 bit/thread or 128 bit/thread transactions, but gives coalesced and bank conflict free memory accesses without further thought.

danielz · October 14, 2011, 4:11pm

Casting to float4 alone does not solve the issue either. You also have to reassign the accesses to threads so that each thread reads 4 consecutive floats from memory, with the next thread reading the following 4 floats and so on. Then the optimizer will usually be clever enough to replace these by a single 128 bit/thread memory transaction.

If you are free to rearrange the data in memory, then the simplest thing probably is to just change the layout to this one:

atom0_mol0 x atom0_mol1 x atom0_mol2 x …

atom1_mol0 x atom1_mol1 x atom1_mol2 x …

…

atom0_mol0 y atom0_mol1 y atom0_mol2 y …

atom1_mol0 y atom1_mol1 y atom1_mol2 y …

…

atom0_mol0 z atom0_mol1 z atom0_mol2 z …

atom1_mol0 z atom1_mol1 z atom1_mol2 z …

…

It will not automatically lead you 64 bit/thread or 128 bit/thread transactions, but gives coalesced and bank conflict free memory accesses without further thought.

Thanks for the reply. Just to make sure I understand it right, for the NVIDIA Fermi architecture, a warp accesses the memory concurrently, hence, the following

atom0_mol0 x atom0_mol1 x atom0_mol2 x … atom0_mol31x

So if I have 32 or more threads in a block, in a warp, thread0 accesses atom0_mol0 x, thread 1 accesses atom0_mol1x and so on.

the above will be grouped into a (32*32-bit)/8 = 128B transaction?

Similarly for the other atoms and co-ordinates?

tera · October 15, 2011, 6:34pm

Yes.

Topic		Replies	Views
Memory coalescing in one thread CUDA Programming and Performance	17	16599	March 31, 2011
char global memory access optimization CUDA Programming and Performance	17	11871	May 31, 2010
coalescing problem CUDA Programming and Performance	4	1064	August 8, 2011
Question about coalesced memory access CUDA Programming and Performance	10	2753	September 24, 2009
Coalesced Memory Read Question CUDA Programming and Performance	7	3033	February 24, 2016
Memory access should be coalesced but is not CUDA Programming and Performance	6	1062	May 16, 2019
Another question about coalesced reads/writes CUDA Programming and Performance	10	2129	August 18, 2009
Coalesced Memory access related doubt CUDA Programming and Performance	13	1997	December 9, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11681	March 20, 2009
N threads read N+1 elements: Coalesced possible? CUDA Programming and Performance	10	4089	March 11, 2008

Memory Coalescing

Related topics