Shared memory does not have the same value for a block of thread (issue with curand)

Hello

I am currently facing a strange issue in my kernel regarding curand() and shared memory.

In my kernel I have this loop (in pseudo code) :

Initialization of the shared memory
__syncthreads()

// get the current state of the generator
curandStateMtgp32 * pThreadState = pDevState + blockIdx.x  ;

for all elt do
     Generate pseudo random with curand()
     Do something with the number
     Store something in the shared memory (array of ~200 bins)
done

__syncthreads()

Do something with the shared memory

I am using a mersene twister generator initialized with :

HANDLE_ERROR(cudaFree(pDevMCSimuData->pDevState)) ;
	HANDLE_ERROR(cudaFree(pDevMCSimuData->pDevMTParams)) ;


	HANDLE_ERROR(cudaMalloc((void**)&(pDevMCSimuData->pDevState),
			                 ISG_NBBLOCKS*sizeof(curandStateMtgp32_t)));
	HANDLE_ERROR(cudaMalloc((void**)&(pDevMCSimuData->pDevMTParams),
			                 sizeof(mtgp32_kernel_params)));

    /* Reformat from predefined parameter sets to kernel format, */
    /* and copy kernel parameters to device memory               */

    HANDLE_CURAND_ERROR(curandMakeMTGP32Constants(mtgp32dc_params_fast_11213,
    		                                     (mtgp32_kernel_params *)pDevMCSimuData->pDevMTParams));

    /* Initialize one state per thread block */

    HANDLE_CURAND_ERROR(curandMakeMTGP32KernelState((curandStateMtgp32_t *)pDevMCSimuData->pDevState,
                                                    mtgp32dc_params_fast_11213,
                                                    (mtgp32_kernel_params *)pDevMCSimuData->pDevMTParams,
    		                                        ISG_NBBLOCKS,
    		                                        1234));

After the last syncthreads, if I display some elt of my shared memory, the value is not same for all the threads of the same threads.

After investigation, I found that If I am not calling curand() all is working great.

I have no idea why I have such behaviour. Does the problem comes from the way I am using the curand or is there already known bug about curand.

I am working with 16 blocks and 256 threads per block.

Technical informations
CUDA version 6.5
OS: CentOS 6.X
Graphic card : NVidia Quadro K4200
CC : 3.0

Thank you very much for your help

Regards

you show too little to make clear points/ suggestions

if the shared memory values are not the same, it may be a race, or poor synchronization
but it all depends on how the shared memory is used
and with what you showed, it is not possible to tell

Unfortunately, I cannot show more than that.
I have run the racecheck tool but I got no error.
I am just surprised that shared memory can have different value after I can print them after a __syncthreads() call. Is there already known error or issue that can cause this kind of behaviour ?

Regards

it is still difficult to tell whether you are simply using shared memory as local memory, or actually using shared memory as shared memory

you need multiple threads reading/ writing the same shared memory address, for a shared memory variable not identical across threads to be ‘surprising’
so, are you truly storing a common value in a common variable, and do you subsequently read the common variable, commonly?

I used my shared memory to store an histogram. To sum up my operation :

Do something with the number

means : I computing the value and the bin of my histogram

Store something in the shared memory (array of ~200 bin

means: updating the histogram with the computed value at the computed bin using atomic operation (addAtomic())

Do something with the shared memory

means : after the syncthread() I want to copy my shared memory into my global memory :

// aLocalHist : stored in shared memory
// aGlobalHBistogram : stored in global memory
// GlobalId is an index function to the BlockIdx.x and the threadIdx.x
tid = threadIdx.x
if (tid< MAX_BIN) {

aGlobalHBistogram[GlobalId] = aLocalHist[tid]
}

I have seen this method for creating histogram in multiple CUDA tutorial.

Unfortunately I don’t have the expected value:

The thread #0 will update my histogram (with a not null value) at bin #140 but when I am copying the histogram from shared memory to global the bin #140.

If I am printing bin #140 from the thread #0 I will have the expected value, but if I am printing the same bin from thread #140 the value are different

So to answer your question , I believe I am using shared memory as shared memory and not as local memory.

“If I am printing bin #140 from the thread #0 I will have the expected value, but if I am printing the same bin from thread #140 the value are different”

i think i now follow

at the same time, given that you use atomics, when the histogram is completed, shouldn’t all threads discard their values and reread the shared memory array to have the correct values?
if the correct value is in shared memory, and thread 0 can access it, surely thread x can access it as well
if thead y happens to have the correct value, this is purely accident/ luck/ coincidence, given that you are using atomics

or not?