coalescing problem

bit_mapper · August 2, 2011, 5:00pm

I have a question about coalescing access the global memory. Theoretically according to the program guide, coalesced accessing the consecutive 32 data address by the consecutive 32 threads would greatly reduce the memory transactions and latency, like below

T0 T1 T2…T31
| | | |
D0 D1 D2…D31

Then all 32 access can be coalesced into just one memory transaction. I have two questions generally:

What if consecutive N threads in a group read the same data address, like N=4 below

Will this introduce more overhead and do harm to the efficiency?

What if I assign all the thread blocks to do the same thing, i.e. let consecutive N threads read consecutive N address. In this case, D0 will be read simultaneously by all the T0 from different thread blocks. Will this be a conflict and reduce the access efficiency or just like a broadcasting?
I’m coalescing all threads of a block to read from the array “Data”. Data[threadIdx.x] is the data that threadIdx.x fetched. If I’m also want this thread to access Data[threadIdx.x + blockDim.x] in the program, is the fetch of Data[threadIdx.x + blockDim.x] also automatically coalesced? What I’m worrying about is that it is true for the later access, consecutive threads are reading consecutive addresses, but they might not be at the same time due to the execution of previous part of the program.

Thanks

Dude1205 · August 3, 2011, 9:53am

I have a question about coalescing access the global memory. Theoretically according to the program guide, coalesced accessing the consecutive 32 data address by the consecutive 32 threads would greatly reduce the memory transactions and latency, like below

T0 T1 T2…T31

| | | |

D0 D1 D2…D31

Then all 32 access can be coalesced into just one memory transaction. I have two questions generally:

What if consecutive N threads in a group read the same data address, like N=4 below

T0 T1 T2 T3 T4 T5 T6 T7 T8…T28 T29 T30 T31

| | | | | | | | | | | | |

D0 D0 D0 D0 D0 D1 D1 D1 D1…D7 D7 D7 D7

Will this introduce more overhead and do harm to the efficiency?

What if I assign all the thread blocks to do the same thing, i.e. let consecutive N threads read consecutive N address. In this case, D0 will be read simultaneously by all the T0 from different thread blocks. Will this be a conflict and reduce the access efficiency or just like a broadcasting?

I’m coalescing all threads of a block to read from the array “Data”. Data[threadIdx.x] is the data that threadIdx.x fetched. If I’m also want this thread to access Data[threadIdx.x + blockDim.x] in the program, is the fetch of Data[threadIdx.x + blockDim.x] also automatically coalesced? What I’m worrying about is that it is true for the later access, consecutive threads are reading consecutive addresses, but they might not be at the same time due to the execution of previous part of the program.

Thanks

Hi,

I can’t answer question 1. I’ve never tested such access patterns and I’d rather let someone else answer the question.

Concerning question 2, different blocks do not necessarily run simultaneously ; coalescing happens at warp level. These accesses should neither cause conflicts nor will they be broadcast (note : you might benefit from the cache, if you have one on your card, but that would need to be confirmed by someone else).

I’m not sure I understand question 3. If you’re worried about coalescing when different threads of the same block have taken different paths in your code, then you might want to read some documentation about “diverging branches”, because it is really something you want to avoid as much as possible.

As far as coalescing is concerned, this will NOT be a problem. Again, coalescing is a warp-level issue and threads within a warp execute instructions in a lock step fashion, meaning that all the threads within a warp always execute the same instruction at any given time.

The way divergent branches within a warp are handled, is that, if at a given if statement two threads of the same warp take different paths, both paths will be executed sequentially by all the threads within this warp and instructions executed by threads that are not concerned with a given path will have no effect.

Hope this helps.

bit_mapper · August 3, 2011, 7:49pm

Thanks very much Dude:)

For question 2, I know that different blocks do not necessarily run simultaneously, because there might be multiple blocks handled and kept swapping on just one multiprocessor. But what if a data address is to be access simultaneously by all the threads with the same index, e.g. for all the threadInx.x=1, let them access data[1] in global memory at the same time, whether they are in different blocks of one multiprocessor, or even on different multiprocessors.

For question 3, I got your point that as long as there is no branching within a warp, all the operation, like read access, calculation, and write back should all be finished in lock-step fashion. Then if all threads within a warp are working like this, is my understanding correct that no synchronization is needed within a warp after execution? Synchronizations are possibly only needed between warps when you assign all warps of a block do the same thing.

BTW: what do you mean by “instructions executed by threads that are not concerned with a given path will have no effect.”

Thank you.

Hi,

I can’t answer question 1. I’ve never tested such access patterns and I’d rather let someone else answer the question.

Concerning question 2, different blocks do not necessarily run simultaneously ; coalescing happens at warp level. These accesses should neither cause conflicts nor will they be broadcast (note : you might benefit from the cache, if you have one on your card, but that would need to be confirmed by someone else).

I’m not sure I understand question 3. If you’re worried about coalescing when different threads of the same block have taken different paths in your code, then you might want to read some documentation about “diverging branches”, because it is really something you want to avoid as much as possible.

As far as coalescing is concerned, this will NOT be a problem. Again, coalescing is a warp-level issue and threads within a warp execute instructions in a lock step fashion, meaning that all the threads within a warp always execute the same instruction at any given time.

The way divergent branches within a warp are handled, is that, if at a given if statement two threads of the same warp take different paths, both paths will be executed sequentially by all the threads within this warp and instructions executed by threads that are not concerned with a given path will have no effect.

Hope this helps.

Dude1205 · August 4, 2011, 9:05am

Thanks very much Dude:)

For question 2, I know that different blocks do not necessarily run simultaneously, because there might be multiple blocks handled and kept swapping on just one multiprocessor. But what if a data address is to be access simultaneously by all the threads with the same index, e.g. for all the threadInx.x=1, let them access data[1] in global memory at the same time, whether they are in different blocks of one multiprocessor, or even on different multiprocessors.

For question 3, I got your point that as long as there is no branching within a warp, all the operation, like read access, calculation, and write back should all be finished in lock-step fashion. Then if all threads within a warp are working like this, is my understanding correct that no synchronization is needed within a warp after execution? Synchronizations are possibly only needed between warps when you assign all warps of a block do the same thing.

BTW: what do you mean by “instructions executed by threads that are not concerned with a given path will have no effect.”

Thank you.

As far as I know, having different blocks read from the same memory location simultaneously should not be a problem.

Concerning question 3 bis, threads within a warp will always execute instructions in a lock-step fashion, even if there are divergent branches. Synchronization within a warp is not necessary with one exception. I read yesterday on this forum (I can’t remember the name of the post) that if one thread writes to a memory location and later on reads from that same memory location, recent versions of the compiler might optimize the code and keep the value in register to avoid a global memory transaction. In this particular case, only that thread will have the correct value for that memory location. This behavior can be prevented by declaring the variable to be “volatile”.

What I meant by “instructions executed by threads that are not concerned with a given path will have no effect.” is the following :

If you are doing something like this (awful example) :

if(threadIdx.x < WARP_SIZE/2)

{

    data[offset + threadIdx.x] = some_value;

}

then even though only the first half of the first warp is concerned with the instruction within the if statement, all the threads within that first warp will execute the instruction but only the first half will actually write to the memory location. The only consequence is that within a warp, threads with “shorter” paths cannot be ahead of other threads. You can’t expect either to have half of you warp execute one branch while the other one is executing another branch. This is why divergent branches are to be avoided as much as possible, because you are basically wasting cycles.

bit_mapper · August 8, 2011, 6:44pm

I see. Then I shall always try to group a warp of 32 threads to do the same thing.

Then the question left is what if consecutive N threads in a group read the same data address, like N=4 below

T0 T1 T2 T3 T4 T5 T6 T7 T8…T28 T29 T30 T31

| | | | | | | | | | | | |

D0 D0 D0 D0 D0 D1 D1 D1 D1…D7 D7 D7 D7

As far as I know, having different blocks read from the same memory location simultaneously should not be a problem.

Concerning question 3 bis, threads within a warp will always execute instructions in a lock-step fashion, even if there are divergent branches. Synchronization within a warp is not necessary with one exception. I read yesterday on this forum (I can’t remember the name of the post) that if one thread writes to a memory location and later on reads from that same memory location, recent versions of the compiler might optimize the code and keep the value in register to avoid a global memory transaction. In this particular case, only that thread will have the correct value for that memory location. This behavior can be prevented by declaring the variable to be “volatile”.

What I meant by “instructions executed by threads that are not concerned with a given path will have no effect.” is the following :

If you are doing something like this (awful example) :
if(threadIdx.x < WARP_SIZE/2)

{

    data[offset + threadIdx.x] = some_value;

}
then even though only the first half of the first warp is concerned with the instruction within the if statement, all the threads within that first warp will execute the instruction but only the first half will actually write to the memory location. The only consequence is that within a warp, threads with “shorter” paths cannot be ahead of other threads. You can’t expect either to have half of you warp execute one branch while the other one is executing another branch. This is why divergent branches are to be avoided as much as possible, because you are basically wasting cycles.

Topic		Replies	Views
There must be a reason, but I can not figure out:( CUDA Programming and Performance	8	8090	January 3, 2008
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1562	August 2, 2011
questions about coalescing access coalescing access CUDA Programming and Performance	8	1971	November 13, 2009
N threads read N+1 elements: Coalesced possible? CUDA Programming and Performance	10	4089	March 11, 2008
Shared memory optimization for algorithm accessing neighbors CUDA Programming and Performance cuda , kernel	5	884	December 15, 2021
Memory access should be coalesced but is not CUDA Programming and Performance	6	1062	May 16, 2019
Some Performance Consideration Questions warp divergence, coalescing and shared mem then and now CUDA Programming and Performance	1	2009	March 8, 2012
Registers and Shared Memory question CUDA Programming and Performance	7	5452	September 10, 2007
char global memory access optimization CUDA Programming and Performance	17	11871	May 31, 2010
Accessing same global memory address within warps CUDA Programming and Performance	4	4121	October 24, 2018

coalescing problem

Related topics