global to shared mem loads and sync

quak · February 17, 2008, 11:17am

Hi,

I’m currently unsuccessfully debugging my cuda application so I would appreciate a verification of the following piece of code:

if( tid < 5 )

            triMem[tid] = cptrTriData[idx*5+tid];

triMem is a pointer to shared memory, cptrTriData to global memory. Both are of type float2*. What I want to to is to load a 40 byte structure with the first 5 threads to shared memory and subsequently process the data with all 32 threads of the block. I am relying on implicit syncronistaion of warps, however I am not sure if this also works for reads from global to shared memory because of the latency. Should I a __syncthread() or does this code work correctly ?

Thanks, quak

kuisma · February 17, 2008, 12:01pm

Yes, you should. “The issue order of the warps within a block is undefined […]” (Programming Guide 1.1 page 15 section 3.2).

Even if you know the first warp will execute first, it will be held back while waiting for global memory fetch, allowing the second wrap not fetching from the global memory to not only execute but also continue executing entering code dependent of triMem, and if it is not loaded, you loose. ;-)

quak · February 17, 2008, 1:50pm

I don’t think you fully understood my question, I am not concerned with multiple warps but with multiple threads within a warp. The data loaded to triMem will only be processed by threads within the same warp, so warp issue order does not matter.

kuisma · February 18, 2008, 11:00am

Okay, but your warp will diverge at the if-condition, making the scheduler serialize it. Without __syncthreads() it will not converge again, and I don’t see why the scheduler should not execute thread 5-31 during the memory latency of thread 0-4.

sicb0161 · February 18, 2008, 11:33am

kuisma:

there is no divergence to expect as the if-condition is correlated with the thread id. So there is no serialization. The synchronizing mechanism is needed for preventing data hazards and not for divergence issues.

quak:

I am not really sure if syncs are needed. In 4.4.2 it says that it is used for preventing data hazards, i.e. “when some threads within a block access the same addresses in shared or global memory”. Altough your block consists of 32 threads only, i.e. a warp, you might need a sync depending on your code after reading from global mem.

Cem

kuisma · February 18, 2008, 11:45am

The if-condition must be multiple of the warp size (i.e. 32) to avoid a divergent wrap and serialization. 5 is not an even multiple of 32, so serialization will occur here.

sicb0161 · February 18, 2008, 12:56pm

I dont know really what you mean by : the “if-condition must be a multiple of the warp size”

In PTX 7.5 “Divergence of threads in control constructs” it says that threads executing different control flow paths are called divergent. Nothing said about multiples of warp size.

However, the if condition does not let threads diverge because of the condition itself or can you tell me which threads of the 32 will execute a different control flow path ?

kuisma · February 18, 2008, 1:13pm

One warp execute as a unit. If different threads of one and the same warp chooses different control flow paths, it no longer can execute as one unit - the warp is divergent. This is solved transparent by the scheduler, but you loose performance due to serialization.

This is documented in the programming guide v1.1 heading 5.1.1.2.

seibert · February 18, 2008, 1:15pm

The warp scheduler will have to break up the threads in warp into different warps if there is divergence within the warp. This reduces parallelism since the now-divided warps will have empty slots. If the branching is such that all the threads in the warp take the same branch, then there will be no divergence. This is often achieved in practice by making the if condition depend on (threadIdx.x/32), which will be the same number for all threads in a warp, if the warp size is 32 (as it is for all current chips).

AndreiB · February 18, 2008, 1:35pm

All threads in a warp must follow same branch. Otherwise you’ll get performance penalty.

UPD: Too slow =)

sicb0161 · February 18, 2008, 2:12pm

aha, I understand. But in the case of setting a conidition like

if (threadIdx.x < N){

// Code

}

don’t I garantuee that the branch point is non-divergent ? I thought that in this case some of the threads within this warp will be inactive (threadIdx.x >= N) and others active - but there will be no divergence.

seibert · February 18, 2008, 6:14pm

aha, I understand. But in the case of setting a conidition like
if (threadIdx.x < N){

// Code

}
don’t I garantuee that the branch point is non-divergent ? I thought that in this case some of the threads within this warp will be inactive (threadIdx.x >= N) and others active - but there will be no divergence.

[snapback]328884[/snapback]

Depends what is after the block. If you don’t have a __syncthreads() after the if statement, the threadIdx.x >= N threads can keep going.

sicb0161 · February 18, 2008, 10:25pm

Okayyyy, things are still not 100 percent clear to me.

Well if I don’t have any data dependencies , so that data hazards might occur , it might not bother me if threads with thx >= N might go on.

Does divergence occur or not? To my mind this must not depend on the following code.

In case warp divergence occur,

2.1. does synchronizing help to reconverge warps or does this happen automatically? Must I put I threadsync for a reconvergence ?

2.2. can we know which warps are diverged ? I would assume that this might depend on N. Say N is 10 and we have 4 Warps within a block then it should be that the last three warps do not diverge.

Thanks for the answer.

quak · February 19, 2008, 12:32pm

My understanding of warp divergence is the following:
When different threads inside a warp follow two different execution paths they will be serialized in two execution “blocks”. These “blocks” can be executed in any order, however as soon as they both return to the same execution path the threads that reach this “convergence point” first, must wait for all threads to reach this point. Otherwise it could not be guaranteed that divergent threads converge ever again. They could even execute the same execution path with a stride and thus be divergent. In such a scenario one would always need to insert a syncthreads() after each branch to make it possible for divergent threads to converge again.

Example:

//code block 0

if( condition ){

//code block 1
}

//code block 2

Threads that do not execute code block 1 have to wait for all other threads at the beginning of code block 2 because there the execution paths converge again.
That also means that code block 1 has to be executed before code block 2 if any of the threads evaluate condition to be true.

kuisma · February 19, 2008, 12:53pm

So if “code block 1” performs a “return”, “goto”, “break” or something else making it not proceed to “code block 2”, this would deadlock?

sicb0161 · February 19, 2008, 1:13pm

My understanding of warp divergence is the following:

When different threads inside a warp follow two different execution paths they will be serialized in two execution “blocks”. These “blocks” can be executed in any order, however as soon as they both return to the same execution path the threads that reach this “convergence point” first, must wait for all threads to reach this point. Otherwise it could not be guaranteed that divergent threads converge ever again. They could even execute the same execution path with a stride and thus be divergent. In such a scenario one would always need to insert a syncthreads() after each branch to make it possible for divergent threads to converge again.

Example:

//code block 0

if( condition ){

//code block 1

}

//code block 2

Threads that do not execute code block 1 have to wait for all other threads at the beginning of code block 2 because there the execution paths converge again.

That also means that code block 1 has to be executed before code block 2 if any of the threads evaluate condition to be true.

[snapback]329360[/snapback]

If you have divergence of warps because of the condition I am not sure if threads which do not execute code block 1 wait for the others threads of the same warp. I also think that they follow a different execution path until a convergence point is set.

In PTX 1.1 it is said that “for divergent control flow, the optimizing code generator automatically determines points of reconvergence”. So I am still not sure if a __syncthreads() helps to reconverge ?! I always thought, because of the description in the programming guide, that __syncthreads() are made for avoiding data hazards. But of course in the guide it also says that, once all threads have reached this point, execution resumes normally.

Besides the point of reconvergence I am still not sure when divergence of a warp happens. In other words, when or if threads follow two different execution paths (especially in your case : if(thx < N)). Cant it be also that there is no divergence ?

quak · February 19, 2008, 1:35pm

Example:

//code block 0

if( condition ){

//code block 1

if( another_condition ) return;
}

//point of reconvergence 0
//code block 2

//point of reconvergence 1
return;

All threads execute code block 0. Then threads diverge, some will execute code block 1 and the rest will directly reach por0. por0 now acts like a barrier for code execution so the conditional code will be executed first, i.e. code block 1. Then another conditional is met and threads diverge again. some will reach por1 and the rest will reach por0 where the rest of the threads is waiting to continue execution. so at this point all threads converge again except those that have already reached por1. code block 2 is now executed and finally at por1 all threads converge again.

kuisma · February 19, 2008, 2:28pm

Why should CUDA not exploit this perfectly good parallel execution opportunity by forcing a convergence point? Depending on timing, latency etc, the scenario you describe MAY occur, but I sure don’t think it MUST occur. It would be a waste of resources.

If you need a synchronization point, use __syncthreads(). All other forms of magic is doomed to make your code unstable and dependent of compilers, driver versions, subtle timing issues and such.

quak · February 19, 2008, 3:50pm

Because it isn’t a perfectly good execution opportunity as code block 2 also needs to be executed by the threads that execute code block 1 so you would end up executing code block 2 (and all subsequent code) twice until a explicit syncthreds() is reached. That would mean that every if(){} construct should be followed by a syncthreads() right ?

kuisma · February 19, 2008, 4:27pm

No, that would be as inefficient as if every if-condition was followed by an implicit forced synchronization point, as you imply.

if (threadIdx.x & 0x01)

  x= devicememory[42];

if (!(threadIdx.x & 0x01))

  x= devicememory[4711];

__syncthreads();

I’d really like this code to diverge so they both can be suspended simultaneous waiting for memory latency, compared to have two consecutive memory waits.

Of course no code will be executed twice, but execution slots in the SIMD may go waisted. Still this can be a cheap price compared to memory latency.

Topic		Replies	Views
order of execution in a divergent warp CUDA Programming and Performance	27	4637	January 17, 2012
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16889	January 5, 2008
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21275	March 12, 2007
Warp-synchronous programming in the presence of warp divergence CUDA Programming and Performance	11	3113	April 18, 2011
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8789	December 1, 2009
Problem with correct branching within a warp CUDA Programming and Performance	23	15710	May 28, 2009
Back to SIMD CUDA Programming and Performance	21	293	November 12, 2024
Possible synchronization bug on Maxwell? CUDA Programming and Performance	10	1648	October 13, 2014
CUDA warp branching problem Stange bug due to unexpected warp branching order CUDA Programming and Performance	17	12147	July 11, 2010
32 thread block doesn't need _syncthreads()? CUDA Programming and Performance	18	13803	January 21, 2024

global to shared mem loads and sync

Related topics