Cooperative Group sync causing cuda-memcheck error (Debug mode ONLY)

I’ve been working proactively to write all my CUDA kernels using the Cooperative Groups in CUDA 9.0. Everything has been working great and the code executes smoothly. As I begin to use cuda-memcheck to check for memory issues I have found some odd behavior. Whenever I run the following

cuda-memcheck --tool racecheck ./my_program

on code compiled with the Debug flag it errors out on the following cudaDeviceSynchronize. Error below

CUDA error at ../src/Particle_BPF_GPU_impl.h:347 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize( )" 
MC:1 CUDA error at ../src/Particle_BPF_GPU_impl.h:347 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize( )"

When I run cuda-memcheck on my Release code I have no issue. Below is a snippet of code

template<typename T>
__global__ void __launch_bounds__(kTPB) addInitialState( const int n, T * __restrict__ m_devParticleState,
		const T * m_devRandomNumbers ) {

	const auto block = cg::this_thread_block( );

	typedef cub::CacheModifiedInputIterator<cub::LOAD_LDG, T> InputItr;
	typedef cub::BlockLoad<T, kTPB, kSysDim, cub::BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
	typedef cub::BlockStore<T, kTPB, kSysDim, cub::BLOCK_STORE_WARP_TRANSPOSE> BlockStore;

	__shared__ union TempStorage {
		typename BlockLoad::TempStorage load;
		typename BlockStore::TempStorage store;
	} temp_storage;

	auto tid = blockIdx.x * blockDim.x + threadIdx.x;
	auto loop = blockIdx.x * blockDim.x * kSysDim;
	const auto gridSize = blockDim.x * gridDim.x;
	T threadData[kSysDim] { };
	T randomNumbers[kSysDim] { };

	while ( tid < n ) {

		BlockLoad( temp_storage.load ).Load( InputItr( m_devRandomNumbers + loop ), randomNumbers );
		block.sync( );  //<- No issues with replaced with __syncthreads() (Debug ONLY);

#pragma unroll kSysDim
		for ( auto i = 0; i < kSysDim; i++ ) {
			threadData[i] = 0.0f;
#pragma unroll kSysDim
			for ( auto j = 0; j < kSysDim; j++ ) {
				threadData[i] += cProcessNoiseCov[i * kSysDim + j] * randomNumbers[j];
			threadData[i] += cInitialState[i];

		BlockStore( ).Store( m_devParticleState + loop, threadData );
		block.sync( );  //<- No issues with replaced with __syncthreads() (Debug ONLY);

		tid += gridSize;
		loop += gridSize * kSysDim;

The error received when executed on Debug mode goes away when


is replaced with



It would nice to know if what I’m seeing is normal or am I implementing something wrong.