cuda-memcheck synccheck tool detects unexpected barrier erros on Volta GPU

Hi there,

Running the example code below in the cuda-memcheck synccheck tool on a volta GPU (TITAN V) detects the barrier error below (for all threads of the first warp). Running it on a Kepler GPU (TITAN Black) does not. Obviously the code has divergent threads, but syncthreads is called outside the conditional. Is there something relevant to be aware of here? I am having issues with some critical section coding that stopped working on the Volta GPU (wasn’t able to reproduce the bug in a minimal example yet) and I’m wondering if I am missing something here?

#include <cstdio>                 
#include "assert.h"               
                                  
#define N 1024                    
                                  
__device__ int* array;            
                                  
__global__ void sync_kernel(){    
    int tid = threadIdx.x;        
    if (tid == 0){                
        array = new int[N];       
    }                             
    __syncthreads();  // <- barrier error detected              
    array[tid] = tid;             
}                                 
                                                                  
__global__ void check_kernel(){   
    int tid = threadIdx.x;        
    assert(array[tid] == tid);    
}                                    
                                  
int main(){                       
    sync_kernel<<<1, N>>>();      
    check_kernel<<<1, N>>>();     
}
========= Barrier error detected. Divergent thread(s) in block
=========     at 0x00000130 in .../test_synccheck.cu:13:sync_kernel(void)
=========     by thread (1,0,0) in block (0,0,0)
=========     Device Frame: .../test_synccheck.cu:13:sync_kernel(void) (sync_kernel(void) : 0x130)

Do you still get the barrier error when you leave out the dynamic memory allocation with new int[N] ?

Maybe the heap allocation contains an implicit synchronization within the conditional block.

Christian

The code looks so straightforward I doubt it’s wrong.

Just brainstorming for your bug report to NVidia, does it still report an error if you use malloc() instead of C++ new?

How about trying the (very very similar) example code right from the Programming Guide?
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#per-thread-block-allocation

Technically cuda-memcheck might be correct here.

The result of an exception from new() is undefined. What it actually does is terminate the current thread.

I don’t think Nvidia has ever specified whether using __syncthreads() when some threads have already exited is allowed (correct me if I am wrong here!). It just used to work so far (bar maybe problems in early CUDA releases). So cuda-memcheck flagging this up might be the right thing to do.

What we probably want is Nvidia being a bit more specific in specifications.

Thanks for the replies.

@SPWorley
In the example from the Programming Guide the same Barrier error is detected.

@cbuchner1
I constructed a similar example without any memory allocation in the diverging path (below) and there is no error detected by the synccheck tool.

And btw: I found the bug in my critical section code, had nothing to do with sync problems, just a missing volatile qualifier for a data pointer… looks like the Kepler GPU didn’t cache the pointer value, but the Volta GPU did.

#include <cstdio>                                      
#include "assert.h"                                    
                                                       
#define N 1024                                         
                                                       
__device__ int* array;                                 
                                                       
__global__ void init_kernel(){                         
    int tid = threadIdx.x;                             
    if (tid == 0){                                     
        array = new int[N];                            
    }                                                  
}                                                      
                                                       
__global__ void sync_kernel(){                         
    int tid = threadIdx.x;                             
    __shared__ volatile int value;                     
    if (tid == 0){                                     
        value = -1;                                    
    }                                                  
    __syncthreads();                                   
    array[tid] = value;                                
}                                                      
                                                       
__global__ void check_kernel(){                        
    int tid = threadIdx.x;                             
    assert(array[tid] == -1);                          
}                                                      
                                                       
int main(){                                            
    init_kernel<<<1, 1, sizeof(volatile int)>>>();     
    sync_kernel<<<1, N>>>();                           
    check_kernel<<<1, N>>>();                          
}

I’ve filed an internal bug for the barrier report from synccheck.

The development team has taken a look at the issue. It appears to be a tools issue and not reflective of any problem or actual defect in the code itself. As SPWorley said, that seems fairly evident just by looking at the code itself.

I think for the time being it is safe to say this is a spurious report from synccheck for this case.

The indication at this time is that this will be fixed in a future CUDA release but I don’t have any further details beyond that.

Thanks @Robert_Crovella for clarifying.

In tera’s post from 10/25/2018, there was the remark that nVidia did not really specify if it is legal to synchronize only those threads which have not yet exited. That remark appears not to have been answered.
I have a related problem, in which the last two stages of the algorithm need only half of the threads assigned to the kernel. So, I have

Version A:

if( threadIdx.x < blockDim.x / 2 )
{
… // codeblock 1
__syncthreads();
… // codeblock 2
__syncthreads();
}

The results are always correct, but the synccheck tool complains. Of course, I can cut this snippet in two:

Version B:

if( threadIdx.x < blockDim.x / 2 )
{
… // codeblock 1
}
__syncthreads();
if( threadIdx.x < blockDim.x / 2 )
{
… // codeblock 2
}
__syncthreads();

Now synccheck is happy, but I am not so much. The loss of efficiency is only very minor, but readability suffers as well.
So the question is: Can I safely stay with Version A?