Shared memory does not have the same value for a block of thread (issue with curand)

ttart398 · September 1, 2015, 3:20pm

Hello

I am currently facing a strange issue in my kernel regarding curand() and shared memory.

In my kernel I have this loop (in pseudo code) :

Initialization of the shared memory
__syncthreads()

// get the current state of the generator
curandStateMtgp32 * pThreadState = pDevState + blockIdx.x  ;

for all elt do
     Generate pseudo random with curand()
     Do something with the number
     Store something in the shared memory (array of ~200 bins)
done

__syncthreads()

Do something with the shared memory

I am using a mersene twister generator initialized with :

HANDLE_ERROR(cudaFree(pDevMCSimuData->pDevState)) ;
	HANDLE_ERROR(cudaFree(pDevMCSimuData->pDevMTParams)) ;


	HANDLE_ERROR(cudaMalloc((void**)&(pDevMCSimuData->pDevState),
			                 ISG_NBBLOCKS*sizeof(curandStateMtgp32_t)));
	HANDLE_ERROR(cudaMalloc((void**)&(pDevMCSimuData->pDevMTParams),
			                 sizeof(mtgp32_kernel_params)));

    /* Reformat from predefined parameter sets to kernel format, */
    /* and copy kernel parameters to device memory               */

    HANDLE_CURAND_ERROR(curandMakeMTGP32Constants(mtgp32dc_params_fast_11213,
    		                                     (mtgp32_kernel_params *)pDevMCSimuData->pDevMTParams));

    /* Initialize one state per thread block */

    HANDLE_CURAND_ERROR(curandMakeMTGP32KernelState((curandStateMtgp32_t *)pDevMCSimuData->pDevState,
                                                    mtgp32dc_params_fast_11213,
                                                    (mtgp32_kernel_params *)pDevMCSimuData->pDevMTParams,
    		                                        ISG_NBBLOCKS,
    		                                        1234));

After the last syncthreads, if I display some elt of my shared memory, the value is not same for all the threads of the same threads.

After investigation, I found that If I am not calling curand() all is working great.

I have no idea why I have such behaviour. Does the problem comes from the way I am using the curand or is there already known bug about curand.

I am working with 16 blocks and 256 threads per block.

Technical informations
CUDA version 6.5
OS: CentOS 6.X
Graphic card : NVidia Quadro K4200
CC : 3.0

Thank you very much for your help

Regards

little_jimmy · September 2, 2015, 5:26am

you show too little to make clear points/ suggestions

if the shared memory values are not the same, it may be a race, or poor synchronization
but it all depends on how the shared memory is used
and with what you showed, it is not possible to tell

ttart398 · September 2, 2015, 7:48am

Unfortunately, I cannot show more than that.
I have run the racecheck tool but I got no error.
I am just surprised that shared memory can have different value after I can print them after a __syncthreads() call. Is there already known error or issue that can cause this kind of behaviour ?

Regards

little_jimmy · September 3, 2015, 4:33am

it is still difficult to tell whether you are simply using shared memory as local memory, or actually using shared memory as shared memory

you need multiple threads reading/ writing the same shared memory address, for a shared memory variable not identical across threads to be ‘surprising’
so, are you truly storing a common value in a common variable, and do you subsequently read the common variable, commonly?

ttart398 · September 3, 2015, 2:13pm

I used my shared memory to store an histogram. To sum up my operation :

Do something with the number

means : I computing the value and the bin of my histogram

Store something in the shared memory (array of ~200 bin

means: updating the histogram with the computed value at the computed bin using atomic operation (addAtomic())

Do something with the shared memory

means : after the syncthread() I want to copy my shared memory into my global memory :

// aLocalHist : stored in shared memory
// aGlobalHBistogram : stored in global memory
// GlobalId is an index function to the BlockIdx.x and the threadIdx.x
tid = threadIdx.x
if (tid< MAX_BIN) {

aGlobalHBistogram[GlobalId] = aLocalHist[tid]
}

I have seen this method for creating histogram in multiple CUDA tutorial.

Unfortunately I don’t have the expected value:

The thread #0 will update my histogram (with a not null value) at bin #140 but when I am copying the histogram from shared memory to global the bin #140.

If I am printing bin #140 from the thread #0 I will have the expected value, but if I am printing the same bin from thread #140 the value are different

So to answer your question , I believe I am using shared memory as shared memory and not as local memory.

little_jimmy · September 4, 2015, 4:45am

“If I am printing bin #140 from the thread #0 I will have the expected value, but if I am printing the same bin from thread #140 the value are different”

i think i now follow

at the same time, given that you use atomics, when the histogram is completed, shouldn’t all threads discard their values and reread the shared memory array to have the correct values?
if the correct value is in shared memory, and thread 0 can access it, surely thread x can access it as well
if thead y happens to have the correct value, this is purely accident/ luck/ coincidence, given that you are using atomics

or not?

Topic		Replies	Views
Shared memory write conflicts Looking for a little help... CUDA Programming and Performance	5	4920	September 7, 2007
Visual Cryptography with shared memory CUDA Programming and Performance	4	1020	April 14, 2015
strange error about shared memory CUDA Programming and Performance	4	2312	November 30, 2007
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8298	April 15, 2011
Shared memory problem CUDA Programming and Performance	4	1557	April 1, 2009
Histogram computation CUDA Programming and Performance	7	6355	February 8, 2008
Error/bug(?) in bitonic merge, related to shared memory usage. CUDA Programming and Performance	2	3457	February 12, 2010
Problems with using shared memory CUDA Programming and Performance	7	6047	January 13, 2011
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16313	July 21, 2017
Fastest way for each thread wrks with 624 integers CUDA Programming and Performance	12	8845	April 19, 2008

Shared memory does not have the same value for a block of thread (issue with curand)

Related topics