Running the example code below in the cuda-memcheck synccheck tool on a volta GPU (TITAN V) detects the barrier error below (for all threads of the first warp). Running it on a Kepler GPU (TITAN Black) does not. Obviously the code has divergent threads, but syncthreads is called outside the conditional. Is there something relevant to be aware of here? I am having issues with some critical section coding that stopped working on the Volta GPU (wasn’t able to reproduce the bug in a minimal example yet) and I’m wondering if I am missing something here?
#include <cstdio>
#include "assert.h"
#define N 1024
__device__ int* array;
__global__ void sync_kernel(){
int tid = threadIdx.x;
if (tid == 0){
array = new int[N];
}
__syncthreads(); // <- barrier error detected
array[tid] = tid;
}
__global__ void check_kernel(){
int tid = threadIdx.x;
assert(array[tid] == tid);
}
int main(){
sync_kernel<<<1, N>>>();
check_kernel<<<1, N>>>();
}
========= Barrier error detected. Divergent thread(s) in block
========= at 0x00000130 in .../test_synccheck.cu:13:sync_kernel(void)
========= by thread (1,0,0) in block (0,0,0)
========= Device Frame: .../test_synccheck.cu:13:sync_kernel(void) (sync_kernel(void) : 0x130)
The result of an exception from new() is undefined. What it actually does is terminate the current thread.
I don’t think Nvidia has ever specified whether using __syncthreads() when some threads have already exited is allowed (correct me if I am wrong here!). It just used to work so far (bar maybe problems in early CUDA releases). So cuda-memcheck flagging this up might be the right thing to do.
What we probably want is Nvidia being a bit more specific in specifications.
@SPWorley
In the example from the Programming Guide the same Barrier error is detected.
@cbuchner1
I constructed a similar example without any memory allocation in the diverging path (below) and there is no error detected by the synccheck tool.
And btw: I found the bug in my critical section code, had nothing to do with sync problems, just a missing volatile qualifier for a data pointer… looks like the Kepler GPU didn’t cache the pointer value, but the Volta GPU did.
#include <cstdio>
#include "assert.h"
#define N 1024
__device__ int* array;
__global__ void init_kernel(){
int tid = threadIdx.x;
if (tid == 0){
array = new int[N];
}
}
__global__ void sync_kernel(){
int tid = threadIdx.x;
__shared__ volatile int value;
if (tid == 0){
value = -1;
}
__syncthreads();
array[tid] = value;
}
__global__ void check_kernel(){
int tid = threadIdx.x;
assert(array[tid] == -1);
}
int main(){
init_kernel<<<1, 1, sizeof(volatile int)>>>();
sync_kernel<<<1, N>>>();
check_kernel<<<1, N>>>();
}
The development team has taken a look at the issue. It appears to be a tools issue and not reflective of any problem or actual defect in the code itself. As SPWorley said, that seems fairly evident just by looking at the code itself.
I think for the time being it is safe to say this is a spurious report from synccheck for this case.
The indication at this time is that this will be fixed in a future CUDA release but I don’t have any further details beyond that.
In tera’s post from 10/25/2018, there was the remark that nVidia did not really specify if it is legal to synchronize only those threads which have not yet exited. That remark appears not to have been answered.
I have a related problem, in which the last two stages of the algorithm need only half of the threads assigned to the kernel. So, I have
Now synccheck is happy, but I am not so much. The loss of efficiency is only very minor, but readability suffers as well.
So the question is: Can I safely stay with Version A?