Question about coalesced memory access

Hi!

I’m new with cuda and need your help. I try to understand the meaning of coalesced memory and how to use it. My kernel works with 16-Bit unsigned short arrays. The programming guide shows that a per thread access of 2 Bytes is coalesced, if all words lie in a 64 byte segment. But, why a 64byte segment? A 32byte segment is big enough, or not?

In my case they lie in a 32byte segment. A 64 byte segment contained data for 32 threads in my case. Would that be coalesced?

Thanks in advance!

If you want to read 2byte data with 32 threads (thats the warp size, and coalescing can be done only per warp anyway) and every thread accesses another 2byte for full coalescing you would need those to lie within one 64byte (32*2) part of the memory. If several threads access the same element (say thread 0 and 1 access element 0; 2 and 3 element 1 and so on) it would still be done with one read.
If one half of your threads read their data from one 64 byte segment, and the other half from another 64 byte segment, it would be done in 2 memory fetch operations. Generally: if a warp reads 2byte data from n different 64 byte segments (no matter which thread accesses which element) n memory fetch operations are issued.

Best regards
Ceearem

Thanks for this answer!

Are you shure? I thought only a HALF warp will access a segment!?

regards1!

What compute capability are you on? I don’t think CC < 1.2 will coalesce at all (though I’ve only ever worked on CC 1.3, so I’m not very familier with it).

For CC >= 1.2 Ceerarem’s explaination is correct except that most places where it reads “warp” it should read “half warp”.

EDIT: Further info (CC >= 1.2) - if you are aligned and accessing a 32 byte segment in a half-warp the memory transaction will be only 32 bytes, rather than 64 as there is a mechanism to reduce the transaction size. Allowing 2-byte data to sit in a 64 byte segment allows you to do better if you are unaligned, or have a stride.

You are right, so far its a half warp. There is a warning though that that will change in future architectures (and since they probably don’t talk about years from now, I think its likely that that might change already with the next generation of GPUs due end of this year).

Also all of this is only true (as Tigga wrote) if you have compute capability 1.3. For Compute capability below that there are more requirements (i.e. threads have to access words in strict order [im not sure if that is even 4 byte order, that is the first thread accesses the first 4 bytes the second the next 4 and so on], if any of these requirements is not fullfilled there will be a single memory fetch operation for each and every thread.

Best regards

Ceearem

Sorry, but I don’t understand.

If I have an array of 16 floats (16 * 4 Byte), a HALF warp can transfer all the data in one memory transaction, because it’ s in one segment (a 128 byte segment stands in the progrmming guide). But 16 * 4 = 64 byte, so a segment of 64 byte is big enough!

A 128 byte segment for 4-Byte data elements is only sensefull if there is one memory transaction per FULL warp. Than, I have 4*32 = 128 Byte.

Please, help!

Yes a segment of 64 byte is large enough, but as far as I know you COULD have the data spread out over the full 128byte. That means, you could for example only read every second float and still have a coalesced access (with CC 1.3).

Ah, ok! I understand now. Thanks a lot!

best regards

A last question:

Are memory accesses coalesced if threads of an half warp access the same data element? For example I have something like that:

threads 0 - 15 → int x = globalMemArray[ threadIdx.x / 2];

So the threads 0 and 1 access globalMemArray[0], thread 2 and 3 access globalMemArray[1], …

How many data transfers are required in this case?

Thanks in advance!

Nobody, who can help?

IIRC in Compute < 1.2 devices it’s uncoalesced and requires 16 separate fetches. Compute 1.2+ devices are smarter than that and coalesce it.