Question about coalesced memory access

DSCH · September 23, 2009, 10:09am

Hi!

I’m new with cuda and need your help. I try to understand the meaning of coalesced memory and how to use it. My kernel works with 16-Bit unsigned short arrays. The programming guide shows that a per thread access of 2 Bytes is coalesced, if all words lie in a 64 byte segment. But, why a 64byte segment? A 32byte segment is big enough, or not?

In my case they lie in a 32byte segment. A 64 byte segment contained data for 32 threads in my case. Would that be coalesced?

Thanks in advance!

ceearem · September 23, 2009, 11:30am

If you want to read 2byte data with 32 threads (thats the warp size, and coalescing can be done only per warp anyway) and every thread accesses another 2byte for full coalescing you would need those to lie within one 64byte (32*2) part of the memory. If several threads access the same element (say thread 0 and 1 access element 0; 2 and 3 element 1 and so on) it would still be done with one read.
If one half of your threads read their data from one 64 byte segment, and the other half from another 64 byte segment, it would be done in 2 memory fetch operations. Generally: if a warp reads 2byte data from n different 64 byte segments (no matter which thread accesses which element) n memory fetch operations are issued.

Best regards
Ceearem

DSCH · September 23, 2009, 12:04pm

Thanks for this answer!

Are you shure? I thought only a HALF warp will access a segment!?

regards1!

Tigga · September 23, 2009, 12:28pm

What compute capability are you on? I don’t think CC < 1.2 will coalesce at all (though I’ve only ever worked on CC 1.3, so I’m not very familier with it).

For CC >= 1.2 Ceerarem’s explaination is correct except that most places where it reads “warp” it should read “half warp”.

EDIT: Further info (CC >= 1.2) - if you are aligned and accessing a 32 byte segment in a half-warp the memory transaction will be only 32 bytes, rather than 64 as there is a mechanism to reduce the transaction size. Allowing 2-byte data to sit in a 64 byte segment allows you to do better if you are unaligned, or have a stride.

ceearem · September 23, 2009, 12:41pm

You are right, so far its a half warp. There is a warning though that that will change in future architectures (and since they probably don’t talk about years from now, I think its likely that that might change already with the next generation of GPUs due end of this year).

Also all of this is only true (as Tigga wrote) if you have compute capability 1.3. For Compute capability below that there are more requirements (i.e. threads have to access words in strict order [im not sure if that is even 4 byte order, that is the first thread accesses the first 4 bytes the second the next 4 and so on], if any of these requirements is not fullfilled there will be a single memory fetch operation for each and every thread.

Best regards

Ceearem

DSCH · September 23, 2009, 12:51pm

Sorry, but I don’t understand.

If I have an array of 16 floats (16 * 4 Byte), a HALF warp can transfer all the data in one memory transaction, because it’ s in one segment (a 128 byte segment stands in the progrmming guide). But 16 * 4 = 64 byte, so a segment of 64 byte is big enough!

A 128 byte segment for 4-Byte data elements is only sensefull if there is one memory transaction per FULL warp. Than, I have 4*32 = 128 Byte.

Please, help!

ceearem · September 23, 2009, 1:05pm

Yes a segment of 64 byte is large enough, but as far as I know you COULD have the data spread out over the full 128byte. That means, you could for example only read every second float and still have a coalesced access (with CC 1.3).

DSCH · September 23, 2009, 1:22pm

Ah, ok! I understand now. Thanks a lot!

best regards

DSCH · September 23, 2009, 2:01pm

A last question:

Are memory accesses coalesced if threads of an half warp access the same data element? For example I have something like that:

threads 0 - 15 → int x = globalMemArray[ threadIdx.x / 2];

So the threads 0 and 1 access globalMemArray[0], thread 2 and 3 access globalMemArray[1], …

How many data transfers are required in this case?

Thanks in advance!

DSCH · September 24, 2009, 11:28am

Nobody, who can help?

_Big_Mac · September 24, 2009, 2:52pm

IIRC in Compute < 1.2 devices it’s uncoalesced and requires 16 separate fetches. Compute 1.2+ devices are smarter than that and coalesce it.

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	1978	December 9, 2010
Require clarification for Memory coalescing? CUDA Programming and Performance hw , cuda	4	1534	October 5, 2023
Another question about coalesced reads/writes CUDA Programming and Performance	10	2128	August 18, 2009
char global memory access optimization CUDA Programming and Performance	17	11865	May 31, 2010
coalesced read short integer cuda CUDA Programming and Performance	5	1371	October 21, 2010
Coalesced memory access CUDA Programming and Performance	2	1091	October 16, 2009
Coalesced? CUDA Programming and Performance	6	2823	February 7, 2009
Coalescing CUDA Programming and Performance	17	5937	October 25, 2010
memory coalescing CUDA Programming and Performance	4	5446	June 10, 2011
There must be a reason, but I can not figure out:( CUDA Programming and Performance	8	8089	January 3, 2008

Question about coalesced memory access

Related topics