I’ve been working proactively to write all my CUDA kernels using the Cooperative Groups in CUDA 9.0. Everything has been working great and the code executes smoothly. As I begin to use cuda-memcheck to check for memory issues I have found some odd behavior. Whenever I run the following
cuda-memcheck --tool racecheck ./my_program
on code compiled with the Debug flag it errors out on the following cudaDeviceSynchronize. Error below
CUDA error at ../src/Particle_BPF_GPU_impl.h:347 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize( )"
MC:1 CUDA error at ../src/Particle_BPF_GPU_impl.h:347 code=77(cudaErrorIllegalAddress) "cudaDeviceSynchronize( )"
When I run cuda-memcheck on my Release code I have no issue. Below is a snippet of code
template<typename T>
__global__ void __launch_bounds__(kTPB) addInitialState( const int n, T * __restrict__ m_devParticleState,
const T * m_devRandomNumbers ) {
const auto block = cg::this_thread_block( );
typedef cub::CacheModifiedInputIterator<cub::LOAD_LDG, T> InputItr;
typedef cub::BlockLoad<T, kTPB, kSysDim, cub::BLOCK_LOAD_WARP_TRANSPOSE> BlockLoad;
typedef cub::BlockStore<T, kTPB, kSysDim, cub::BLOCK_STORE_WARP_TRANSPOSE> BlockStore;
__shared__ union TempStorage {
typename BlockLoad::TempStorage load;
typename BlockStore::TempStorage store;
} temp_storage;
auto tid = blockIdx.x * blockDim.x + threadIdx.x;
auto loop = blockIdx.x * blockDim.x * kSysDim;
const auto gridSize = blockDim.x * gridDim.x;
T threadData[kSysDim] { };
T randomNumbers[kSysDim] { };
while ( tid < n ) {
BlockLoad( temp_storage.load ).Load( InputItr( m_devRandomNumbers + loop ), randomNumbers );
block.sync( ); //<- No issues with replaced with __syncthreads() (Debug ONLY);
#pragma unroll kSysDim
for ( auto i = 0; i < kSysDim; i++ ) {
threadData[i] = 0.0f;
#pragma unroll kSysDim
for ( auto j = 0; j < kSysDim; j++ ) {
threadData[i] += cProcessNoiseCov[i * kSysDim + j] * randomNumbers[j];
}
threadData[i] += cInitialState[i];
}
BlockStore( temp_storage.store ).Store( m_devParticleState + loop, threadData );
block.sync( ); //<- No issues with replaced with __syncthreads() (Debug ONLY);
tid += gridSize;
loop += gridSize * kSysDim;
}
}
The error received when executed on Debug mode goes away when
block.sync()
is replaced with
__syncthreads()
.
It would nice to know if what I’m seeing is normal or am I implementing something wrong.