if-else WARP divergence WARP divergence

Lets say I have an if-else construct like this:

if ( (threadIdx.x & (WSIZE-1)) == (WSIZE-1))
{

} else {

}

where WSIZE is the WARP SIZE (32).

Now, the IF-ELSE construct basically says that the last thread in the WARP has to execute something and the remaining threads EXECUTE something else.

So, therez a divergence.

Now, my question is "Can I assume that the statements under “IF” statement will be executed before the “ELSE” "? OR Will the WARP scheduler execute the ELSE statement first as majority of threads will take that path ??? OR Does this depend on how the compiler compiles it ???

Thanks

Can an NVIDIA person assist me with this question here? THanks a lot.

You cannot rely on any particular order. Just as you wouldn’t on the order of then- and else- clauses on CPUs with multiple threads.

Paulius

First of all – THanks for answering. None wants to reply to this thread. FInally u had. THanks.

I dont think you can compare this with CPU. CPU does NOT have SIMULATANEOUS execution. Multi-tasking might make it look concurrent. But they are NOT PARALLEL as in the case of threads in a WARP.

I was talking about WARP divergence in my post. i.e. divergence among the threads executing in the WARP.

Consider a WARP executing the statement above. All 32 threads would execute the IF statement. The LAST thread in the WARP will be OK with the IF. All other threads in the WARP will take the ELSE part. Now, in the next instruction cycle, the WARP scheduler has 2 options. It can schedule the LAST thread first and finish it off and then schedule the remaining threads and finsih them off and then all threads finally converge back together. OR it could schedule the remaining 31 threads first followed by the last thread and then converge.

So, I was wondering which one will be given precedence. I would guess that this would be something to do with how the COMPILER generates code. But if some1 from NVIDIA could confirm it – that would be great!

If you want to guarantee order, why not simply do the following:

if (condition)

  stuff to do first

if (!condition)

  stuff to do second

And who tells you that the compiler does not optimize that in such a way that the order gets reversed?

I guess this is a bit abstract without some specific code example. I guess I was actually assuming some data dependency between the two and probably a __syncthreads(). Otherwise, why should anyone care what order they are performed in.

Perhaps the CPU analogy wasn’t ideal, but the idea is the same. Threads within a warp execute the same instruction. Thus, in a case of a conditional statement, they all execute either the “then” of the “if” clause. Threads that don’t meet the criteria for either clause don’t fetch or write operands. So, if you have two divergent paths within a warp, the two will be serialized, entire warp executing both. That’s where the performance penalty comes from, if you diverge within a warp.

CUDA language does not give precedence to any of the divergent paths. For a true if-else statement, execution order should not matter. If one clause is dependent on a result from the other, then it’s not really an if-else situation. You should then follow Jonathan’s advice above, having two conditional statements, with sycnthreads to avoid smem dependency hazards.

Paulius

Thanks to all you guys for your replies.

Yes, True. Predicated execution. Thanks. Eventually I did the Jonathan way… So, it all depends on how the compiler has generated code.

If you are talking about a single processor system then its fine. On a parallel computing system without a locking facility, shared-memory dependency between IF and ELSE could matter. No?

See this (assuming only 1 WARP):

If (threadIdx.x == 15)

shared_mem_1 = Shared_mem_0++;

else

shared_mem_1 = Shared_mem_2++;

NOw what is in “shared_mem_1” – depends on whether the IF is executed first OR the ELSE is.

And Regarding shared-memory harzards, I would like to add the following – my own take on this issue.

"

CUDA 1.0 manual says that there is no sequential ordering between threads of the same WARP with respect to shared memory until ‘__syncthreads()’ is executed. i.e. if thread “I” refers to a shared memory location which is updated by thread “J” then there is no guarantee that thread “I” would get the latest data until __syncthreads() is executed.

At the outset, this looks like some hardware limitation. But it is NOT. This is basically a compiler optimization where the compiler caches “memory” locations in registers to avoid un-necessary “loads” in the generated code. The compiler does NOT optimize across “__syncthreads() call. That is how it looks like “__syncthreads” solves the problem. The actual run-time implementation of “__syncthreads” has nothing to do with this problem.

This is solvable by declaring the shared-memory variable as “volatile”. CUDA 1.0 manual does NOT come clean on this. But CUDA 1.1 manual states this clearly. Usage of “volatile” specifier is a guaranteed way of confirming ordering amongst threads of the same WARP. Even PTX 1.1 ISA manual talks about

”ld.volatile” and “st.volatile” specifiers in the assembly language.

NOTE: Note that it would be better to use “volatile” keyword only when you work with CUDA 1.1 to enforce sequential ordering amongst threads of same warp. Why I say this is that the PTX instruction set itself is virtual and is translated again for a “target” architecture. So, the introduction of “ld.volatile” and “st.volatile” instructions in PTX 1.1 could mean that these instructions are NOT optimized while being translated. PTX 1.0 does NOT talk about these variations. So, it would be better to work with “volatile” specifiers only when you are on CUDA 1.1 environment to enforce sequential ordering amongst threads of the same WARP with respect to shared memory.

"

The case you have is not really an if-else case, as discussed in previous posts.

I’m not sure the compiler “caches” smem values in registers. Have you observed this? What is meant by unspecified ordering when writing to smem, is that if threads x and y of the same warp write different values to the same smem location, you can’t count on knowing which one of values will be in smem after the write completes.

Paulius

What are you talking about? Synchthreads is a barrier, the manual says so. If warps 1-100 load data and then start processing that data, it could be that only warps 1-10 have finished loading data when warp 8 starts processing data. Warp 8 could then try to read data that warp 100 should have loaded, but hasn’t yet. Syncthreads synchronizes them so that the processing step cannot occur until all threads have loaded their data. The volatile keyword will not solve the same problem as a barrier.

Right. Across warps – you need __syncthreads.

But I was talking about threads within a WARP. Check out section 4.2.2.3 in CUDA manual 1.1 and the corresponding section in CUDA manual 1.0. See the introduction of “volatile” there.

Consider this code:

#include <stdio.h>

// Assuming only 1 WARP is scheduled for a block.

__device__ __global__ void mykernel(int *globe)

{

    __shared__ int i;

    i = 25;

     i = i + threadIdx.x;

  

  if (threadIdx.x == 0)

   *globe = i;

    return;

}

int main()

{

	void *devptr;

	dim3 grid, block;

	int i;

	cudaMalloc(&devptr, sizeof(int));

	block.x = 32;

	mykernel <<< grid , block >>> ((int*)devptr);

	cudaThreadSynchronize();

	cudaMemcpy(&i, devptr, sizeof(int), cudaMemcpyDeviceToHost);

	printf("I = %d\n", i);

}

What would be the value that one can expect to be printed from “main” ?

It is 25. Because, the compiler would cache the value of the shared variable i in a register. Since thread 0 is the one that performs the global write – the value cached in thread 0 regsiter would be written. And, that would be 25.

But is 25 the value that was present in the shared memory ??? May be NOT – depends on how the writes to the shared memory location were ordered.

Now – in this case – with only one warp – we do NOT have an ordering with respect to shared memory. Instead of “__syncthreads()” , you can use the “volatile” keyword to get the ordering within one WARP.

There might be applications where there could be a need for INTRA-WARP ordering. They could benifit by using “volatile” instead of __syncthreads() which has a much larger overhead.

Not sure – if I was clear. Hope I was.

Herez the sample PTX code thati s generated for the “global memory write” for the code above.

mov.s32  $r3, 0;              	//  

	setp.ne.u32  $p1, $r1, $r3;    //  

	@$p1 bra  $Lt_0_4;            	//  

	.loc	13	12	0

	ld.param.u32  $r4, [__cudaparm__Z8mykernelPi_globe];	//  id:13 __cudaparm__Z8mykernelPi_globe+0x0

	st.global.s32  [$r4+0], $r2;  	//  id:14

$Lt_0_4:

	.loc	13	14	0

	exit;                          //  

The predicated branch is for “if (threadIdx.x == 0)”. The code below loads the global memory address in r4 and writes the value of “r2” there. If you use volatile, “r2” will be reloaded from shared-memory before writing into the global space.

How is is possible to define an “ordering” within a warp, when all threads in a warp are defined to be executed simultaneously? Your use of writing to the same shared variable from within a warp (or among warps, it doesn’t matter) is completely nonsensical, introducing race conditions. It doesn’t matter what the hardware/compiler does in such a case, since the output should be undefined.

Well, the example i had quoted is definitely nonsensical. But I wrote it only for elucidating the problem.

But consider a case where therez an array of data. Each thread in the WARP updates one element of this array.

Then, subsequently, threads read adjacent data-items (In this case - the data from an adjoining thread is loaded), perform some computations and write back results.

And so on… In such cases, you dont need syncthreads to enforce ordering among threads inside a WARP.

That was my point. Kindly look into the CUDA manual section that I had quoted.

Kindly ignore this message. See my message (post) above

There was an error when I posted my last message. The message apperas but the main forum does NOT indicate my latest update. This message is intended to get the forum’s attention.

The situation you describe will STILL need a __syncthreads() because neighboring warps will be reading data from each other. Future hardware could potentially have a smaller warp sizes, so you wouldn’t want to write code that depends on the warp size being 32.

Is the __syncthreads() really that high of an overhead in this situation anyways? Maybe I’ll benchmark this little example later today to find out.

Really, the only point I’m trying to make is that is is easier, safer, and more natural to think as a block as a bunch of threads and develop algorithms and programs to that model, leaving the warps for the hardware to figure out. I understand completely what you are doing volatile, it’s just that I don’t see it being applicable except in extremely odd circumstances.

I tested a simple kernel with and without syncthreads after every shared mem write.:

__global__ void test_without_sync(int* g_idata, int* g_odata)

    {

    const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

    extern __shared__ volatile int shared_v[];

    shared_v[threadIdx.x] = g_idata[idx];

    __syncthreads();

   int neigh = threadIdx.x - 1;

    if (neigh < 0)

        neigh = blockDim.x - 1;

   for (int i = 0; i < 10000; i++)

        {

        shared_v[threadIdx.x] += shared_v[neigh]*2;

        shared_v[threadIdx.x] += shared_v[neigh]*3;

        shared_v[threadIdx.x] += shared_v[neigh]*4;

        shared_v[threadIdx.x] += shared_v[neigh]*5;

        shared_v[threadIdx.x] += shared_v[neigh]*6;

        shared_v[threadIdx.x] += shared_v[neigh]*7;

        shared_v[threadIdx.x] += shared_v[neigh]*8;

        shared_v[threadIdx.x] += shared_v[neigh]*9;

        shared_v[threadIdx.x] += shared_v[neigh]*10;

        shared_v[threadIdx.x] += shared_v[neigh]*11;

        }

   g_odata[idx] = shared_v[threadIdx.x];

    }

The syncthreads calls add a ~20% overhead in my tests (code available if anyone really wants it). Of course, the results will be incorrect with any block size except the warp size. So there is a substantial performance benefit available in a contrived example. What real kernel application would perform so many “sync” operations only in a 32-wide window, I don’t know.

Any kernel that does something a little more substantial than a single add and multiply between shmem reads will cause this overhead to reduce significantly, obviously.

Yeah True! Between WARPs you need __syncthreads() anyway. I agree with all that you say.

Thanks for your experiments!

Also,
I have seen some posts which talk about kernels whose last thread in the WARP alone does something different. Probably those apps can find an usage.

Financial derivatives (like binomial tree, trinomial, etc…) use these kinds of adjacent element processing which would get benefitted if they run only one WARP (and ofcourse many blocks to saturate latencies) and so on.

And,
You can have applications where-in, you do a SWITCH based on WARP_ID and do totally different processing depending on the WARP_ID. Those apps can use it.

Just that you can find a way to use it, if you want to use it. :-)