global to shared mem loads and sync

Hi,

I’m currently unsuccessfully debugging my cuda application so I would appreciate a verification of the following piece of code:

if( tid < 5 )

            triMem[tid] = cptrTriData[idx*5+tid];

triMem is a pointer to shared memory, cptrTriData to global memory. Both are of type float2*. What I want to to is to load a 40 byte structure with the first 5 threads to shared memory and subsequently process the data with all 32 threads of the block. I am relying on implicit syncronistaion of warps, however I am not sure if this also works for reads from global to shared memory because of the latency. Should I a __syncthread() or does this code work correctly ?

Thanks, quak

Yes, you should. “The issue order of the warps within a block is undefined […]” (Programming Guide 1.1 page 15 section 3.2).

Even if you know the first warp will execute first, it will be held back while waiting for global memory fetch, allowing the second wrap not fetching from the global memory to not only execute but also continue executing entering code dependent of triMem, and if it is not loaded, you loose. ;-)

I don’t think you fully understood my question, I am not concerned with multiple warps but with multiple threads within a warp. The data loaded to triMem will only be processed by threads within the same warp, so warp issue order does not matter.

Okay, but your warp will diverge at the if-condition, making the scheduler serialize it. Without __syncthreads() it will not converge again, and I don’t see why the scheduler should not execute thread 5-31 during the memory latency of thread 0-4.

kuisma:

there is no divergence to expect as the if-condition is correlated with the thread id. So there is no serialization. The synchronizing mechanism is needed for preventing data hazards and not for divergence issues.

quak:

I am not really sure if syncs are needed. In 4.4.2 it says that it is used for preventing data hazards, i.e. “when some threads within a block access the same addresses in shared or global memory”. Altough your block consists of 32 threads only, i.e. a warp, you might need a sync depending on your code after reading from global mem.

Cem

The if-condition must be multiple of the warp size (i.e. 32) to avoid a divergent wrap and serialization. 5 is not an even multiple of 32, so serialization will occur here.

I dont know really what you mean by : the “if-condition must be a multiple of the warp size”

In PTX 7.5 “Divergence of threads in control constructs” it says that threads executing different control flow paths are called divergent. Nothing said about multiples of warp size.

However, the if condition does not let threads diverge because of the condition itself or can you tell me which threads of the 32 will execute a different control flow path ?

One warp execute as a unit. If different threads of one and the same warp chooses different control flow paths, it no longer can execute as one unit - the warp is divergent. This is solved transparent by the scheduler, but you loose performance due to serialization.

This is documented in the programming guide v1.1 heading 5.1.1.2.

The warp scheduler will have to break up the threads in warp into different warps if there is divergence within the warp. This reduces parallelism since the now-divided warps will have empty slots. If the branching is such that all the threads in the warp take the same branch, then there will be no divergence. This is often achieved in practice by making the if condition depend on (threadIdx.x/32), which will be the same number for all threads in a warp, if the warp size is 32 (as it is for all current chips).

All threads in a warp must follow same branch. Otherwise you’ll get performance penalty.

UPD: Too slow =)

aha, I understand. But in the case of setting a conidition like

if (threadIdx.x < N){

// Code

}

don’t I garantuee that the branch point is non-divergent ? I thought that in this case some of the threads within this warp will be inactive (threadIdx.x >= N) and others active - but there will be no divergence.

Depends what is after the block. If you don’t have a __syncthreads() after the if statement, the threadIdx.x >= N threads can keep going.

Okayyyy, things are still not 100 percent clear to me.

Well if I don’t have any data dependencies , so that data hazards might occur , it might not bother me if threads with thx >= N might go on.

  1. Does divergence occur or not? To my mind this must not depend on the following code.

In case warp divergence occur,

2.1. does synchronizing help to reconverge warps or does this happen automatically? Must I put I threadsync for a reconvergence ?

2.2. can we know which warps are diverged ? I would assume that this might depend on N. Say N is 10 and we have 4 Warps within a block then it should be that the last three warps do not diverge.

Thanks for the answer.

My understanding of warp divergence is the following:
When different threads inside a warp follow two different execution paths they will be serialized in two execution “blocks”. These “blocks” can be executed in any order, however as soon as they both return to the same execution path the threads that reach this “convergence point” first, must wait for all threads to reach this point. Otherwise it could not be guaranteed that divergent threads converge ever again. They could even execute the same execution path with a stride and thus be divergent. In such a scenario one would always need to insert a syncthreads() after each branch to make it possible for divergent threads to converge again.

Example:

//code block 0

if( condition ){

//code block 1
}

//code block 2

Threads that do not execute code block 1 have to wait for all other threads at the beginning of code block 2 because there the execution paths converge again.
That also means that code block 1 has to be executed before code block 2 if any of the threads evaluate condition to be true.

So if “code block 1” performs a “return”, “goto”, “break” or something else making it not proceed to “code block 2”, this would deadlock?

If you have divergence of warps because of the condition I am not sure if threads which do not execute code block 1 wait for the others threads of the same warp. I also think that they follow a different execution path until a convergence point is set.

In PTX 1.1 it is said that “for divergent control flow, the optimizing code generator automatically determines points of reconvergence”. So I am still not sure if a __syncthreads() helps to reconverge ?! I always thought, because of the description in the programming guide, that __syncthreads() are made for avoiding data hazards. But of course in the guide it also says that, once all threads have reached this point, execution resumes normally.

Besides the point of reconvergence I am still not sure when divergence of a warp happens. In other words, when or if threads follow two different execution paths (especially in your case : if(thx < N)). Cant it be also that there is no divergence ?

Example:

//code block 0

if( condition ){

//code block 1

if( another_condition ) return;
}

//point of reconvergence 0
//code block 2

//point of reconvergence 1
return;

All threads execute code block 0. Then threads diverge, some will execute code block 1 and the rest will directly reach por0. por0 now acts like a barrier for code execution so the conditional code will be executed first, i.e. code block 1. Then another conditional is met and threads diverge again. some will reach por1 and the rest will reach por0 where the rest of the threads is waiting to continue execution. so at this point all threads converge again except those that have already reached por1. code block 2 is now executed and finally at por1 all threads converge again.

Why should CUDA not exploit this perfectly good parallel execution opportunity by forcing a convergence point? Depending on timing, latency etc, the scenario you describe MAY occur, but I sure don’t think it MUST occur. It would be a waste of resources.

If you need a synchronization point, use __syncthreads(). All other forms of magic is doomed to make your code unstable and dependent of compilers, driver versions, subtle timing issues and such.

Because it isn’t a perfectly good execution opportunity as code block 2 also needs to be executed by the threads that execute code block 1 so you would end up executing code block 2 (and all subsequent code) twice until a explicit syncthreds() is reached. That would mean that every if(){} construct should be followed by a syncthreads() right ?

No, that would be as inefficient as if every if-condition was followed by an implicit forced synchronization point, as you imply.

if (threadIdx.x & 0x01)

  x= devicememory[42];

if (!(threadIdx.x & 0x01))

  x= devicememory[4711];

__syncthreads();

I’d really like this code to diverge so they both can be suspended simultaneous waiting for memory latency, compared to have two consecutive memory waits.

Of course no code will be executed twice, but execution slots in the SIMD may go waisted. Still this can be a cheap price compared to memory latency.