Practical rules for coalesced memory access ?

ebfe · September 12, 2008, 5:43pm

Hi,

i’ve read the CUDA programming guide about coalesced memory access but I don’t quite get it. Maybe someone can help me with this practical problem:

My kernel threads access a struct as input consisting of four elements, each of which is made up by five 32bit long integers. 20 longs == 80 bytes input per thread. You can think of this struct as…

   typedef struct {

        unsigned long v0, v1, v2, v3, v4;

    } BUFFER_STRUCT;

   typedef struct {

        BUFFER_STRUCT top_buffer;

        BUFFER_STRUCT bot_buffer;

        BUFFER_STRUCT left_buffer;

        BUFFER_STRUCT right_buffer;

    } gpu_inbuffer;

There is exactly one gpu_inbuffer per thread and every thread will access the members of gpu_inbuffer (and BUFFER_STRUCT) in the same way - there is no divergence.

The Visual Profiler tells me that I get tons of uncoalesced memory accesses - and this is the only global memory struct to blame. Could someone be so kind and guide me how to modify this struct so the GPU can access it’s members in a less latency-driven way? I don’t get how to apply the rules of the programming guide to my case :-(

tmurray · September 12, 2008, 6:16pm

Make arrays of longs instead of arrays of structs. Let’s see if I can make a really dumb ASCII diagram…

Right now, if 1 are the regions of memory you’re reading at any given cycle and 0 are regions you’re not reading, your access pattern is probably something like

100000000000000000001000000000…

It might actually be 3 zeroes for every one (in case you can read the entire structure as one operation)–I’d have to double check how structures are accessed. Regardless, it’s very uncoalesced right now. You could do shared memory trickery to try to deal with some of this, but it’s silly.

If you make arrays of longs so that you have v0_top for all threads contiguous, etc., your access patterns will be perfectly coalesced (assuming you start on a correctly aligned boundary, etc).

ebfe · September 12, 2008, 6:37pm

Well, struct is only for convenience reasons (named address) - it’s made up of the same data type so it should compile to the code as arrays.

The layout in memory now is something like this:

[Thread1.bottom;Thread1.top;Thread1.left;Thread1.right][Thread2.bottom;Thread2.top…]

Do I get your right that the layout should be

[Thread1.bottom;Thread2.bottom;Thread3.bottom…][Thread1.top;Thread2.top;Thread3.top]…

tmurray · September 12, 2008, 6:44pm

Sure, it will compile as an array, but you’re not reading contiguous memory in a given cycle; you’re reading contiguous structs, sure, but they’re relatively far apart in memory and won’t be coalesced as a result.

I’m fairly sure Thread1.bottom, Thread2.bottom, Thread3.bottom won’t be coalesced either since they’re above 16 bytes per struct. Organizing your data as Thread1.bottom.v0, Thread2.bottom.v0, … Thread1.bottom.v1, Thread2.bottom.v1, etc will get you perfect coalescing, though.

AndreiB · September 13, 2008, 7:15am

ebfe
Just my 2 cents. Given the nature of kernel you’re optimizing (WPA-PSK if I get it right) I would not pay much attention on uncoalesced access. You can gain few microseconds but you won’t even notice it compared to overall running time.

As tmurray said, [Thread1.bottom;Thread2.bottom;Thread3.bottom…][Thread1.top;Thread2.top;Thread3.top]… won’t be coalesced. Reading Thread.bottom will most likely be split in two reads: v0-v3 and v5. Re-organizing data layout to [Thread1.bottom.v0;Thread2.bottom.v0;Thread3.bottom.v0…][Thread1.bottom.v1;Thread2.bottom.v1;Thread3.bottom.v1]… will ensure coalescing. But again, I’m pretty sure that for your particular kernel cost of re-organizing data on host side will be higher than penalty for uncoalesced access.

Topic		Replies	Views
Coalesced Memory Access to Structs CUDA Programming and Performance	11	4713	September 19, 2009
coalesced access of a struct of double's is this rite? CUDA Programming and Performance	14	7935	June 29, 2009
Getting coalesced reads and stores using a struct CUDA Programming and Performance	3	3460	April 20, 2010
Efficient use of Array of Structures CUDA Programming and Performance	2	785	March 19, 2018
How to coalesce memory access array of STRUCT CUDA Programming and Performance	17	1551	November 3, 2010
Coalesced structures possible? CUDA Programming and Performance	3	2157	February 1, 2016
Memory, Structs, arrays, etc... CUDA Programming and Performance	0	2301	October 1, 2009
Memory Coalescing CUDA Programming and Performance	5	9331	October 15, 2011
Help Avoiding Un-Coalesced Memory Access CUDA Programming and Performance	9	9298	October 4, 2010
Loading structured data efficiently using CUDA can this be right? CUDA Programming and Performance	8	24373	November 9, 2009

Practical rules for coalesced memory access ?

Related topics