fail to sync threads on Maxwell in kernel that has thread branching

We have an CUDA application that has a complicated kernel structure, which involving thread branching to achieve MapReduce-like operations(we know it is not ideal in CUDA, however, it is required by the application we need).

The issue is that the application runs ok on Fermi and Kepler GPUs, but fails to run on Maxwell. The possible known issue is caused by the failure of thread sychronization.

The kernel has structure like:

kernel{
tid = threadID

while(1){
if(tid == 0){
   //do something
}else{
   //do another thing
}

__syncthreads();

if(some condition){
    if(tid == 0){
        //do something
    }else{
        //do another thing
    }

    __syncthreads();
}

}//end-while

}

Do all threads of a warp do exactly the same number of while loop iterations? You cannot have some threads execute the __syncthread while others don’t (i.e. threads that have already returned).
As to why it used to work before, I do not know.

Thanks, Allleur,

A certain flag variable(in shared memory) is set, followed by a __syncthreads(), to indicate whether the kernel should return.

The same code can be run on Fermi and Kepler, with CUD 4.2 and 5.0 which all works fine. However it does not work on Maxwell and CUDA 6.5. That’s the thing we don’t know why.

Some updates:

We have found a weird thing on our implementation. It works when we compile the code for debugging with flag ‘-g -G’. However, the same issue happened when compiled normally.

Any ideas?

A __syncthreads() in a divergent branch invokes undefined behavior, that is, anything can happen. On occasion, this means such code gives the appearance or working as intended, only to fail with a different CUDA version, different inpout data, or a different GPU. I am not saying that is happening here, as only skeleton code was posted, but it seems something like this could be going on.

It is of course entirely possible that you are running into a compiler bug of sorts, but given the maturity of the CUDA tool chain at this point I would say the likelihood of a bug in the kernel code is greater.

I am not sure whether recent versions of cuda-memcheck can detect incorrectly used barriers. In any event, as a minimum sanity check, I would suggest running the code with both the race-checker and memory-checker options for cuda-memcheck to see if it complains about anything. You may also want to run a cross-check, e.g., compile with CUDA 6.5 and run the resulting executable on Fermi and Kepler.

That is a big clue if you are using shared memory.

If you are using shared memory (you code sample is extremely vague and you did not use code blocks) try qualifying all of it as volatile and compile as release and run.

It seems Shenjun posted an update while I was typing my previous reply.

I agree with CudaaduC that one plausible scenario could be a reduction involving shared memory that is written in the warp-synchronous idiom where the compiler re-orders shared memory loads because the shared memory operands involved do not use the volatile qualifier.

While it is fine to apply volatile to all shared memory operands temporarily for debugging purposes as suggested by CudaaduC, for production code the volatile modifier should be used sparingly and only where absolutely necessary as it can inhibit various compiler optimizations, in particular the scheduling of loads.

I encountered similar problem which was discussed in this thread
https://devtalk.nvidia.com/default/topic/779068/possible-synchronization-bug-on-maxwell-

My problem was that conditional predicates like a && (b || c) would cause compiler to generate irregular control flow graph
(because if a is false, then we can directly jump to the else block)
And irregular CFGs may cause threads in the same warp execute the if-else block in different order, even if the whole predicate evaluates the same among all threads.
Then the __syncthreads() would cause problem because of the divergency.

I examined the SASS and found that the situation (different execution order in the same warp) really occurs.
Debug version evaulates this kind of conditional predicate in a more synchronized way so won’t have this problem.

Thanks for your replies.

I have tested what you suggested, such as set shared flag variable as volatile. However it also hangs using a normal compilation, and still work with compiling using debug mode.

@mnnuahg: I don’t have conditional predictes like a && (b || c) in our code.

Thanks

what block dimensions are you using?

and are ‘something’ and ‘another thing’ in:

while(1){
if(tid == 0){
//do something
}else{
//do another thing
}

functions or simply code sections (code blocks)?

depending on the time it takes to execute ‘something’ in the above, i really do not see why the above can not become:

while (condition)
{
if (tid == 0)
{
do something
}

__syncthreads();

if (tid == remaining_threads/threads_minus_1)
{
do other thing
}

__syncthreads();

}

this should likely help to prevent using variables before their values are set properly

Look at the “if (some condition)” line.

From your loop structure it’s likely you’re doing work based on that condition until the condition becomes true. But that means there can be a race between warps where one warp has updated data and other warps may have not read it yet inside their if() test. This all depends on your actual code of course… maybe the if() statement is simple and uniform across the block, but I mention it because that single if() is a very dangerous code smell.

If the if test does depend on data you’re going to modify, you could remove the race by computing the test and storing it in a boolean, syncthreads again, then branch based on the stored boolean.

Yes. Just like this, I have updated my code to separate the thread branching using __syncthreads();

In more detail, the code structure is like:

shared variable flag;
shared variable shouldQuit;

while(1)
{

if(threadIdx.x == 0){
//code block
if(condition){
//update flag to true
}

}

__syncthreads();

if(flag){

while(i from 1 to 10){

//all threads do some thing

if(threadIdx.x == 0){
//reduction calculation

if(condition){
shouldQuit = True;
}

}

if(shouldQuit){return;}
}

}

}

I am using 16 or 32 thread per block, which also could have the race issue?

i am rather confident that your problem lays in the while loop you have outlined just now

more specifically, between

if(condition){
shouldQuit = True;
}

by threadIdx.x == 0

and

if(shouldQuit){return;}
}

}

if i were you, i would put a __syncthreads() in between shouldQuit being updated, and shouldQuit being read

with a debug build the should work, but with a release build likely not - one of the things that __syncthreads() does, is to outline when shared/ global memory writes should be visible to other threads as well, if i am not mistaken
under debug build, the compiler can hardly reorder, for the code to be traceable; under release build, there is little (no) directive, hence the compiler may be of opinion that postponing the write to shouldQuit is more optimal, such that threadIdx.x == 0 essentially ‘holds on to’ shouldQuit

something like that

Hi, Jimmy,

Yes. I do put __syncthreads() between the code section you mentioned. Technically, all the threads are sync’ed after the 0th thread in a block updates some variables.

to the extent that “while(i from 1 to 10){” implies that “all threads do some thing” would be looped 10 times, whether “Technically, all the threads are sync’ed after the 0th thread in a block updates some variables”, without a __syncthreads(), is debatable

come to think of it, if “all threads do some thing” write to (shared, etc) variables, subsequently referenced by “if(threadIdx.x == 0){ //reduction calculation”, to set ‘condition’, you may need a __syncthreads() between “all threads do some thing” and “if(threadIdx.x == 0){ //reduction calculation” as well, for the same reasons mentioned before

you need to guide the compiler as to when shared/ global writes must be visible to other threads (at the very latest), or it is free to assume “whenever”

in my mind, a write to a variable by a thread and a subsequent read of the same variable by another thread, should be separated by a __syncthreads() or similar mechanism that can ensure visibility, especially with release builds that yield certain rights, liberties, privileges to the compiler, for the sake of utilitarian optimization
you are of course welcome to prove that i am wrong to hold such a belief

Thanks, Jimmy,

I do understand what you mentioned about the requirements of using syncthreads function between the map and reduction code blocks and We do put __syncthreads() at those positions.

I believe that our issue is not simply caused by missing __syhcthreads, but other deeper reasons about compiler optimization.

Let me know if you have further thoughts.

what does “but fails to run on Maxwell” mean exactly?

does the kernel crash/ run forever, or does it simply output wrong values?

run memcheck and racecheck on the debug build, and then the release build

see if the release build works, when its level of optimization is set the same as the debug build (level 0 i believe), or not even

dump the assembly of the release build for each level of optimization increment, and compare to that of the debug build

does the kernel make use of functions?

i equally presume your driver actually supports maxwell…?