__syncthreads() not a subset of cudaDeviceSynchronize()?

I run a single, default stream kernel, that ends:

	int pass;

	if(((uint64_t*)end)[0] < 0x7FFFEFEEB58EULL){
		pass = binary_fuse8_contain(((uint64_t*)end)[0], &filt0);
	}
	if(pass){
		reinterpret_cast<uint2*>(ends)[atomicAdd(ctr, 1)] = reinterpret_cast<uint2*>(end)[0];
	}
	__syncthreads();
}

This kernel is immediately followed by:

    cudaDeviceSynchronize();
    int e6count;
    cudaMemcpy(&e6count, ctr, 4, cudaMemcpyDeviceToHost);
    printf("%08d\n", e6count);

I understood cudaDeviceSynchronize() waited for kernel completion and so the __syncthreads() at the end of the kernel would not be required.

However without it, the value of e6count varys between runs, by a small amount.

Edit: SM6.1 and Cuda 10.2.

__syncthreads() is a block-wide execution barrier for device threads (ie. for each threadblock).

I wouldn’t ordinarily connect it in any way with cudaDeviceSynchronize(), which has different semantics.

It’s not obvious to me that they are connected at all based on what I see here.

I would normally be universally skeptical of the need for __syncthreads() as the last line of kernel code (outside of any loop, etc.). I’m not sure what purpose that could serve, semantically. I’ve seen it numerous times. It makes no sense to me.

I guess the crux of your question is why does e6count vary from run to run, iff the __syncthreads() is removed. I can’t explain that and probably wouldn’t be able to without a reproducer.

Bugs are always possible, you might want to try a newer CUDA version. And perhaps your code has a race condition of some sort. The presence of __syncthreads() can certainly change the behavior of broken code.

Thanks Robert. The Cuda Runtime API doc description for cudaDeviceSynchronize() states:
“Blocks until the device has completed all preceding requested tasks.”, and so my belief was that all work would be complete prior to querying e6count.

The kernel prior to the code shown is fairly straightforward, if somewhat warp divergent at this point, (early days testing). There are no dependencies between threads. 64k threads each take the same input and perform a series of hashing and reduction operations, which are recursively repeated based on it’s thread id - tid0 passes the result straight through, tid 65535 has looped 65535 times. It then reaches the filter stage above.

Within a warp, this is obviously not friendly for the last 32 loops as each thread completes a loop apart, but I 'm working on that.

I take you points regarding bugs. Nsight Compute SM6.1 compatibility and to some degree performance drops on later Cuda versions, leave me on 10.2 for now.

I concur with @Robert_Crovella that a call to __syncthreads() at top level at the end of a kernel serves no useful purpose, that is, it is both superfluous and confusing. I cannot explain the observation based on the information that has been provided so far.