I am currently facing a strange issue in my kernel regarding curand() and shared memory.
In my kernel I have this loop (in pseudo code) :
Initialization of the shared memory __syncthreads() // get the current state of the generator curandStateMtgp32 * pThreadState = pDevState + blockIdx.x ; for all elt do Generate pseudo random with curand() Do something with the number Store something in the shared memory (array of ~200 bins) done __syncthreads() Do something with the shared memory
I am using a mersene twister generator initialized with :
HANDLE_ERROR(cudaFree(pDevMCSimuData->pDevState)) ; HANDLE_ERROR(cudaFree(pDevMCSimuData->pDevMTParams)) ; HANDLE_ERROR(cudaMalloc((void**)&(pDevMCSimuData->pDevState), ISG_NBBLOCKS*sizeof(curandStateMtgp32_t))); HANDLE_ERROR(cudaMalloc((void**)&(pDevMCSimuData->pDevMTParams), sizeof(mtgp32_kernel_params))); /* Reformat from predefined parameter sets to kernel format, */ /* and copy kernel parameters to device memory */ HANDLE_CURAND_ERROR(curandMakeMTGP32Constants(mtgp32dc_params_fast_11213, (mtgp32_kernel_params *)pDevMCSimuData->pDevMTParams)); /* Initialize one state per thread block */ HANDLE_CURAND_ERROR(curandMakeMTGP32KernelState((curandStateMtgp32_t *)pDevMCSimuData->pDevState, mtgp32dc_params_fast_11213, (mtgp32_kernel_params *)pDevMCSimuData->pDevMTParams, ISG_NBBLOCKS, 1234));
After the last syncthreads, if I display some elt of my shared memory, the value is not same for all the threads of the same threads.
After investigation, I found that If I am not calling curand() all is working great.
I have no idea why I have such behaviour. Does the problem comes from the way I am using the curand or is there already known bug about curand.
I am working with 16 blocks and 256 threads per block.
CUDA version 6.5
OS: CentOS 6.X
Graphic card : NVidia Quadro K4200
CC : 3.0
Thank you very much for your help