order of execution in a divergent warp

I’ve encountered a serious issue and since all my approaches how to solve it failed, I have to ask for external assistance :) Basically, I have a warp where first thread first atomically subtracts a variable, saves the result to a shared memory and according to this result all other threads do some computation, the code looks pretty much like this:

__device__ void taskFinishSortSORT(const int& tid, const int& taskIdx, volatile Task* task, CUdeviceptr ppsBuf, const int start, const int end, volatile int& left, volatile int& right, const TaskType type)

{

	__shared__ volatile int sharedFinished[MaxBlockHeight]; // We need shared memory to distribute data from thread with tid == 0 to all the others in the same warp

	if(tid == 0)

	{

		int finished = atomicSub(&g_taskStack.tasks[taskIdx].unfinished, 1);

		sharedFinished[threadIdx.y] = finished;

	}

	if(sharedFinished[threadIdx.y] == 0) // Finish the whole sort

	{

		taskFinishSort(tid, taskIdx, task);

	}

}

What’s happening is that all threads from 1 to 31 will do the second if block, before the first (with tid == 0) is done and I have no idea how to avoid it. Of course if this happens the threads will choose to do taskFinishSort when finished is not actually 0 and all crashes and burns … taskFinishSort is a large function, I guess that’s important since when I replace taskFinishSort with some dummy function (with just a return or some few lines of code) it all works fine. Also, it works OK in debug build with debug info, it goes to hell in release. We also looked into the ptx, it seemed that it gets compiled in the correct order, also this decision which should run first should be done by the scheduler on the GPU, right?

So, what am I doing wrong? Thanks in advance for any advice :)

What about __syncthreads();? It will make all the threads to wait until all the threads reach that point. Or maybe some threadfence functions (there is one for the shared memory) will put a barrier so that all threads have to wait until the memory writes are finished.

Is [font=“Courier New”]tid[/font] the same as [font=“Courier New”]threadIdx.x[/font], or is it [font=“Courier New”]threadIdx.y*blockDim.x+threadIdx.x[/font]?

In the latter case it looks like your code wouldn’t work with more than one warp per block.

BTW. on compute capability 1.2 and higher you don’t need to go through shared memory:

__device__ void taskFinishSortSORT(const int& tid, const int& taskIdx, volatile Task* task, CUdeviceptr ppsBuf, const int start, const int end, volatile int& left, volatile int& right, const TaskType type)

{

        int finished=0;

if(threadIdx.x == 0)

                finished = (atomicSub(&g_taskStack.tasks[taskIdx].unfinished, 1) == 0);

if(__any(finished)) // Finish the whole sort

        {

                taskFinishSort(tid, taskIdx, task);

        }

}

__syncthreads will afaik sync warps in a block, not threads in a warp … and yes, tried it, did not work, there’s __threadfence for making sure that your global and shared changes are visible to the device, this is not the problem

it’s the same, we have persistent threads (and if a block is used it’s always [32,y]) … as for your example, i’ll try if it will make the compiler do the finished = (atomicSub(&g_taskStack.tasks[taskIdx].unfinished) before doing if(__any(finished)), since that’s what’s happening now and that is the real issue

Is [font=“Courier New”]taskFinishSortSORT()[/font] ever called within conditional code? Or from within a loop where the number of iterations differs between the threads? Do threads return early from the kernel or any of the (device) functions called from it?

no, no and … no … I know it’s not very precise, I’ve tried to put a printf at the start of the function and then in some other places, and it shows that at the start all threads are together and then 1…31 threads do the second if, then 0 does the first if and then the second …

well, apparently the problem was inlining, so putting noinline in front of taskFinishSort solved the problem … we’re not sure now if it’s all ok, but the simple test we had in place now works as expected

Seems like the compiler isn’t putting the reconvergence point into the optimal position, and deinlining is helping it to find that. We’ve already checked the most common causes for this though.

Which compute capability is your device? Is [font=“Courier New”]g_taskStack[/font] in shared memory?

Are there any (perhaps hidden) [font=“Courier New”]return[/font] or [font=“Courier New”]exit[/font] statements in your code at all, even if never taken? That would influence the choice of reconvergence point as well.

CC 2.0, g_taskStack is in global

well, we think there are no hidden returns, there’s definitely no exit … what exactly would you mean by hidden, like a code that will never execute (in some run of a program or never) but exists nonetheless?

When saying ‘hidden’ I was thinking of occurrences inside other functions, like assert() or so. However code that will never execute but exists nonetheless will be just as interesting, as that would influence the static analysis of the compiler just as any other code.

I’m running out of ideas though. As I assume you don’t want to post your complete code here, we might have to leave it as it is, if nobody else comes up with a suggestion.

It would be possible to confirm the placement of the reconvergence point in the disassembled cubin (with [font=“Courier New”]cuobjdump -sass[/font], then look for the operand of the SSY instruction), however that would tell nothing about why the compiler places it there.

thanks for all your pointers and/or advices, we will try to check for the reconvergence point in disassembled cubin, i’ve tried to compare ptx output for a variant with a large function call and a small function call, those were both the same, but since both included an actual function call (and not an inlined function) I guess ptx output does not include all optimizations … We tried that at the point where we did not know inlining was at fault

The order of execution of sub-warps after a warp-divergence is UNDEFINED.
Any assumptions on that order will result in future-incompatible-programs.

This is what NVIDIA has been telling for long. HTH

If this were true then it would be illegal to ever use __syncthreads() again after a warp-divergence.

There have to be some reasonable guarantees about reconvergence, or the CUDA execution model wouldn’t work. Unfortunately the PTX ISA manual contradicts itself on this subject: Section 8.5 (both in version 2.3 and 3.0) says, just as you pointed out

Later, it states (emphasis mine)

The description of the bar.sync instruction of course clearly indicates that it behaves differently whether a warp is divergent or not.

So, to the letter, you are right. But the bold statement above (pun intended) would be wrong then.

The C Programming Guide on the other hand does not mention such a restriction of __syncthreads(). Appendix B.6 states

It does not mention that use of __syncthreads() is limited even after the conditional code section. So one would assume that Nvidia has made sure their algorithm to insert reconvergence points works well enough to always insert one after any conditional code construct emitted by the compiler. No?

Any clarification by Nvidia employees?

Tera,

I never said anything about re-convergence.
I only said NVIDIA has stated earlier that “The ORDER of execution of sub-warps is UNDEFINED”.
I was just providing this info as “FYI” – if that would be of some help.

HTH,
Best Regards,
Sarnath

I hadn’t realized you were replying just to the thread title, not to the initial post.

I hadn’t realized you were replying just to the thread title, not to the initial post.

What haven’t been said in the originaly post is that inside the function taskFinishSort() there is another single threaded code between if(tid == 0) {…}. We haven’t thought it important but as it shows out the compiler probably thinks it would be best to merge these two blocks.
We have solved this issue with mikee111 by tricking the compiler into not optimizing the code. The gimmick is to use a different id in each single thread condition of the entire call stack. This way the compiler cannot merge the blocks and the code works as intended.

By the way if you take this problem to the limit there is no clean way to initialize shared memory with less than 32 thread. So I suppose that for the compiler the __syncthreads() marks the point of reconvergence.

What haven’t been said in the originaly post is that inside the function taskFinishSort() there is another single threaded code between if(tid == 0) {…}. We haven’t thought it important but as it shows out the compiler probably thinks it would be best to merge these two blocks.
We have solved this issue with mikee111 by tricking the compiler into not optimizing the code. The gimmick is to use a different id in each single thread condition of the entire call stack. This way the compiler cannot merge the blocks and the code works as intended.

By the way if you take this problem to the limit there is no clean way to initialize shared memory with less than 32 thread. So I suppose that for the compiler the __syncthreads() marks the point of reconvergence.

for(int i=threadIdx.x; i<SHARED_MEM_CAPACITY; i += blockDim.x)

{

  sharedArray[i] = INIT_VALUE;

}

__syncthreads();

Wont this initialize a shared array with less than 32 threads a block?