Cooperative Group sync causing cuda-memcheck error (Debug mode ONLY)

manicely6005 · January 25, 2018, 12:15am

I’ve been working proactively to write all my CUDA kernels using the Cooperative Groups in CUDA 9.0. Everything has been working great and the code executes smoothly. As I begin to use cuda-memcheck to check for memory issues I have found some odd behavior. Whenever I run the following

cuda-memcheck --tool racecheck ./my_program

on code compiled with the Debug flag it errors out on the following cudaDeviceSynchronize. Error below

CUDA error at ../src/Particle_BPF_GPU_impl.h:347 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize( )" 
MC:1 CUDA error at ../src/Particle_BPF_GPU_impl.h:347 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize( )"

When I run cuda-memcheck on my Release code I have no issue. Below is a snippet of code

template<typename T>
__global__ void __launch_bounds__(kTPB) addInitialState( const int n, T * __restrict__ m_devParticleState,
		const T * m_devRandomNumbers ) {

	const auto block = cg::this_thread_block( );

	typedef cub::CacheModifiedInputIterator<cub::LOAD_LDG, T> InputItr;
	typedef cub::BlockLoad<T, kTPB, kSysDim, cub::BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
	typedef cub::BlockStore<T, kTPB, kSysDim, cub::BLOCK_STORE_WARP_TRANSPOSE> BlockStore;

	__shared__ union TempStorage {
		typename BlockLoad::TempStorage load;
		typename BlockStore::TempStorage store;
	} temp_storage;

	auto tid = blockIdx.x * blockDim.x + threadIdx.x;
	auto loop = blockIdx.x * blockDim.x * kSysDim;
	const auto gridSize = blockDim.x * gridDim.x;
	T threadData[kSysDim] { };
	T randomNumbers[kSysDim] { };

	while ( tid < n ) {

		BlockLoad( temp_storage.load ).Load( InputItr( m_devRandomNumbers + loop ), randomNumbers );
		block.sync( );  //<- No issues with replaced with __syncthreads() (Debug ONLY);

#pragma unroll kSysDim
		for ( auto i = 0; i < kSysDim; i++ ) {
			threadData[i] = 0.0f;
#pragma unroll kSysDim
			for ( auto j = 0; j < kSysDim; j++ ) {
				threadData[i] += cProcessNoiseCov[i * kSysDim + j] * randomNumbers[j];
			}
			threadData[i] += cInitialState[i];
		}

		BlockStore( temp_storage.store ).Store( m_devParticleState + loop, threadData );
		block.sync( );  //<- No issues with replaced with __syncthreads() (Debug ONLY);

		tid += gridSize;
		loop += gridSize * kSysDim;
	}
}

The error received when executed on Debug mode goes away when

block.sync()

is replaced with

__syncthreads()

.

It would nice to know if what I’m seeing is normal or am I implementing something wrong.

Topic		Replies	Views
CUDA racecheck, shared memory array and cudaDeviceSynchronize() CUDA Programming and Performance	10	2884	February 5, 2013
cudaDeviceSynchronize() returns cudaErrorMemoryAllocation CUDA Programming and Performance	1	549	February 2, 2018
cuda-memcheck synccheck tool detects unexpected barrier erros on Volta GPU CUDA Programming and Performance	8	1523	February 9, 2020
cuda crash without any message CUDA Programming and Performance	12	1827	January 22, 2015
cuda-memcheck (with cc 2.x) and synchronous execution CUDA-MEMCHECK	2	3814	March 28, 2014
Synchronization synchronizing a n body problem. CUDA Programming and Performance	8	4398	September 22, 2009
Program hit cudaErrorIllegalAddress (error 700) [...] on CUDA API call to cudaDeviceSynchronize CUDA-MEMCHECK	4	5098	September 29, 2021
cuda-memcheck CUDA Programming and Performance	1	853	November 16, 2015
cudaThreadSynchronize() error CUDA Programming and Performance	1	2991	October 5, 2009
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7160	April 28, 2011

Cooperative Group sync causing cuda-memcheck error (Debug mode ONLY)

Related topics