order of execution in a divergent warp

mikee111 · January 7, 2012, 4:03pm

I’ve encountered a serious issue and since all my approaches how to solve it failed, I have to ask for external assistance :) Basically, I have a warp where first thread first atomically subtracts a variable, saves the result to a shared memory and according to this result all other threads do some computation, the code looks pretty much like this:

__device__ void taskFinishSortSORT(const int& tid, const int& taskIdx, volatile Task* task, CUdeviceptr ppsBuf, const int start, const int end, volatile int& left, volatile int& right, const TaskType type)

{

	__shared__ volatile int sharedFinished[MaxBlockHeight]; // We need shared memory to distribute data from thread with tid == 0 to all the others in the same warp

	if(tid == 0)

	{

		int finished = atomicSub(&g_taskStack.tasks[taskIdx].unfinished, 1);

		sharedFinished[threadIdx.y] = finished;

	}

	if(sharedFinished[threadIdx.y] == 0) // Finish the whole sort

	{

		taskFinishSort(tid, taskIdx, task);

	}

}

What’s happening is that all threads from 1 to 31 will do the second if block, before the first (with tid == 0) is done and I have no idea how to avoid it. Of course if this happens the threads will choose to do taskFinishSort when finished is not actually 0 and all crashes and burns … taskFinishSort is a large function, I guess that’s important since when I replace taskFinishSort with some dummy function (with just a return or some few lines of code) it all works fine. Also, it works OK in debug build with debug info, it goes to hell in release. We also looked into the ptx, it seemed that it gets compiled in the correct order, also this decision which should run first should be done by the scheduler on the GPU, right?

So, what am I doing wrong? Thanks in advance for any advice :)

pasoleatis · January 7, 2012, 4:22pm

What about __syncthreads();? It will make all the threads to wait until all the threads reach that point. Or maybe some threadfence functions (there is one for the shared memory) will put a barrier so that all threads have to wait until the memory writes are finished.

tera · January 7, 2012, 4:32pm

Is [font=“Courier New”]tid[/font] the same as [font=“Courier New”]threadIdx.x[/font], or is it [font=“Courier New”]threadIdx.y*blockDim.x+threadIdx.x[/font]?

In the latter case it looks like your code wouldn’t work with more than one warp per block.

BTW. on compute capability 1.2 and higher you don’t need to go through shared memory:

__device__ void taskFinishSortSORT(const int& tid, const int& taskIdx, volatile Task* task, CUdeviceptr ppsBuf, const int start, const int end, volatile int& left, volatile int& right, const TaskType type)

{

        int finished=0;

if(threadIdx.x == 0)

                finished = (atomicSub(&g_taskStack.tasks[taskIdx].unfinished, 1) == 0);

if(__any(finished)) // Finish the whole sort

        {

                taskFinishSort(tid, taskIdx, task);

        }

}

mikee111 · January 7, 2012, 9:59pm

__syncthreads will afaik sync warps in a block, not threads in a warp … and yes, tried it, did not work, there’s __threadfence for making sure that your global and shared changes are visible to the device, this is not the problem

Is [font=“Courier New”]tid[/font] the same as [font=“Courier New”]threadIdx.x[/font], or is it [font=“Courier New”]threadIdx.y*blockDim.x+threadIdx.x[/font]?

In the latter case it looks like your code wouldn’t work with more than one warp per block.

BTW. on compute capability 1.2 and higher you don’t need to go through shared memory:
__device__ void taskFinishSortSORT(const int& tid, const int& taskIdx, volatile Task* task, CUdeviceptr ppsBuf, const int start, const int end, volatile int& left, volatile int& right, const TaskType type)

{

        int finished=0;

if(threadIdx.x == 0)

                finished = (atomicSub(&g_taskStack.tasks[taskIdx].unfinished, 1) == 0);

if(__any(finished)) // Finish the whole sort

        {

                taskFinishSort(tid, taskIdx, task);

        }

}

it’s the same, we have persistent threads (and if a block is used it’s always [32,y]) … as for your example, i’ll try if it will make the compiler do the finished = (atomicSub(&g_taskStack.tasks[taskIdx].unfinished) before doing if(__any(finished)), since that’s what’s happening now and that is the real issue

tera · January 7, 2012, 10:34pm

Is [font=“Courier New”]taskFinishSortSORT()[/font] ever called within conditional code? Or from within a loop where the number of iterations differs between the threads? Do threads return early from the kernel or any of the (device) functions called from it?

mikee111 · January 8, 2012, 10:45am

no, no and … no … I know it’s not very precise, I’ve tried to put a printf at the start of the function and then in some other places, and it shows that at the start all threads are together and then 1…31 threads do the second if, then 0 does the first if and then the second …

mikee111 · January 8, 2012, 11:29am

well, apparently the problem was inlining, so putting noinline in front of taskFinishSort solved the problem … we’re not sure now if it’s all ok, but the simple test we had in place now works as expected

tera · January 8, 2012, 12:41pm

Seems like the compiler isn’t putting the reconvergence point into the optimal position, and deinlining is helping it to find that. We’ve already checked the most common causes for this though.

Which compute capability is your device? Is [font=“Courier New”]g_taskStack[/font] in shared memory?

tera · January 8, 2012, 12:52pm

Are there any (perhaps hidden) [font=“Courier New”]return[/font] or [font=“Courier New”]exit[/font] statements in your code at all, even if never taken? That would influence the choice of reconvergence point as well.

mikee111 · January 8, 2012, 2:13pm

CC 2.0, g_taskStack is in global

well, we think there are no hidden returns, there’s definitely no exit … what exactly would you mean by hidden, like a code that will never execute (in some run of a program or never) but exists nonetheless?

tera · January 8, 2012, 2:52pm

When saying ‘hidden’ I was thinking of occurrences inside other functions, like assert() or so. However code that will never execute but exists nonetheless will be just as interesting, as that would influence the static analysis of the compiler just as any other code.

I’m running out of ideas though. As I assume you don’t want to post your complete code here, we might have to leave it as it is, if nobody else comes up with a suggestion.

It would be possible to confirm the placement of the reconvergence point in the disassembled cubin (with [font=“Courier New”]cuobjdump -sass[/font], then look for the operand of the SSY instruction), however that would tell nothing about why the compiler places it there.

mikee111 · January 8, 2012, 3:02pm

thanks for all your pointers and/or advices, we will try to check for the reconvergence point in disassembled cubin, i’ve tried to compare ptx output for a variant with a large function call and a small function call, those were both the same, but since both included an actual function call (and not an inlined function) I guess ptx output does not include all optimizations … We tried that at the point where we did not know inlining was at fault

Sarnath · January 10, 2012, 10:02am

The order of execution of sub-warps after a warp-divergence is UNDEFINED.
Any assumptions on that order will result in future-incompatible-programs.

This is what NVIDIA has been telling for long. HTH

tera · January 10, 2012, 12:04pm

If this were true then it would be illegal to ever use __syncthreads() again after a warp-divergence.

There have to be some reasonable guarantees about reconvergence, or the CUDA execution model wouldn’t work. Unfortunately the PTX ISA manual contradicts itself on this subject: Section 8.5 (both in version 2.3 and 3.0) says, just as you pointed out

Later, it states (emphasis mine)

The description of the bar.sync instruction of course clearly indicates that it behaves differently whether a warp is divergent or not.

So, to the letter, you are right. But the bold statement above (pun intended) would be wrong then.

The C Programming Guide on the other hand does not mention such a restriction of __syncthreads(). Appendix B.6 states

It does not mention that use of __syncthreads() is limited even after the conditional code section. So one would assume that Nvidia has made sure their algorithm to insert reconvergence points works well enough to always insert one after any conditional code construct emitted by the compiler. No?

Any clarification by Nvidia employees?

Sarnath · January 10, 2012, 2:54pm

Tera,

I never said anything about re-convergence.
I only said NVIDIA has stated earlier that “The ORDER of execution of sub-warps is UNDEFINED”.
I was just providing this info as “FYI” – if that would be of some help.

HTH,
Best Regards,
Sarnath

tera · January 15, 2012, 8:41am

I hadn’t realized you were replying just to the thread title, not to the initial post.

tera · January 15, 2012, 8:41am

I hadn’t realized you were replying just to the thread title, not to the initial post.

Revan1 · January 15, 2012, 5:31pm

What haven’t been said in the originaly post is that inside the function taskFinishSort() there is another single threaded code between if(tid == 0) {…}. We haven’t thought it important but as it shows out the compiler probably thinks it would be best to merge these two blocks.
We have solved this issue with mikee111 by tricking the compiler into not optimizing the code. The gimmick is to use a different id in each single thread condition of the entire call stack. This way the compiler cannot merge the blocks and the code works as intended.

By the way if you take this problem to the limit there is no clean way to initialize shared memory with less than 32 thread. So I suppose that for the compiler the __syncthreads() marks the point of reconvergence.

Revan1 · January 15, 2012, 5:31pm

What haven’t been said in the originaly post is that inside the function taskFinishSort() there is another single threaded code between if(tid == 0) {…}. We haven’t thought it important but as it shows out the compiler probably thinks it would be best to merge these two blocks.
We have solved this issue with mikee111 by tricking the compiler into not optimizing the code. The gimmick is to use a different id in each single thread condition of the entire call stack. This way the compiler cannot merge the blocks and the code works as intended.

By the way if you take this problem to the limit there is no clean way to initialize shared memory with less than 32 thread. So I suppose that for the compiler the __syncthreads() marks the point of reconvergence.

Sarnath · January 16, 2012, 4:35am

for(int i=threadIdx.x; i<SHARED_MEM_CAPACITY; i += blockDim.x)

{

  sharedArray[i] = INIT_VALUE;

}

__syncthreads();

Wont this initialize a shared array with less than 32 threads a block?

Topic		Replies	Views
global to shared mem loads and sync CUDA Programming and Performance	26	11666	February 21, 2008
if-else WARP divergence WARP divergence CUDA Programming and Performance	17	16965	January 5, 2008
Wacking the CUDA performance Is this how you can screw up you CUDA CUDA Programming and Performance	16	21341	March 12, 2007
Shared mem atomics Repeat topic CUDA Programming and Performance	47	8930	December 1, 2009
Warp-synchronous programming in the presence of warp divergence CUDA Programming and Performance	11	3160	April 18, 2011
CUDA warp branching problem Stange bug due to unexpected warp branching order CUDA Programming and Performance	17	12189	July 11, 2010
Problem with correct branching within a warp CUDA Programming and Performance	23	15775	May 28, 2009
Race condition within warp CUDA Programming and Performance	9	3123	September 20, 2016
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6931	May 26, 2011
__syncthreads() code execution hangs CUDA Programming and Performance	4	4130	September 20, 2007

order of execution in a divergent warp

Related topics