curandState: strange behaviour (CURAND)

Hello all,

I am writing a program to simulate the random walk of a number of particles (i.e a stochastic differential equation) using the CURAND library.

In principal, I define an interval (say from -1 to 1), define a number of injection points in the interval and inject a number of particles at this points and let them perform a random walk and register the time when a particle hits one of the boundaries.

However, I am facing a problem with my code I’ll try to explain in what follows:

This is my kernel to initialize the PRNG. The kernel is obviously borrowed from the CURAND doc:

__global__ void setup_kernel(curandState *state, long long seed) {
	int id = threadIdx.x+blockIdx.x*blockDim.x;
	curand_init(seed, id, 0, &state[id]);

The code from main() that is important to the problem is as follows:

curandState *devStates;

CHECK_CUDA_ERROR(cudaMalloc((void**)&devStates, threadsPerBlock*blocksPerGrid*sizeof(curandState)));

int nParts = 2000;
int tSteps = 20000;
int mt = 100;
int tStepsRed=tSteps/(mt);

int threadsPerBlock = 512;
int blocksPerGrid =(nParts + threadsPerBlock - 1) / threadsPerBlock;

setup_kernel<<<blocksPerGrid, threadsPerBlock>>>(devStates, seed);

for (b=0; b<nBins; b++) { // loop over all injection points

   for (mti=0; mti<tStepsRed; mti++) {	// I divide tSteps into mt chunks to write the output to file

       // copy xold_D etc. to the device...

       random_walk<<<blocksPerGrid, threadsPerBlock>>>(devStates, xold_D, nParts, mt, 
                                                       sqts, partCount_D, bound1, bound2, diffu);
       // copy everything back to the host




The kernel random_walk looks like this:

__global__ void random_walk(curandState *globalState, float *xold, int nParts, int tSteps, float sqts, int *partCount, float bound1, float bound2, float diffu) {
	int t;
	int pc;
	float x;
	float sdi = sqts*diffu;
	int id = blockDim.x * blockIdx.x + threadIdx.x;
	curandState localState = globalState[id];
	if (id < nParts) {
		pc = partCount[id];
		x = xold[id];
		for (t=0; t<tSteps; t++) {
		if (x > bound1 && x < bound2) { 
				x = x+sdi*curand_normal(&localState);  // this is a simple random walk
				if (x <= bound1) pc=1;
                                   // bound2 does not contribute, but particle that cross bound2 
                                      are lost
	partCount[id] = pc;
	xold[id] = x;
	globalState[id] = localState;	

When I run the code and compare it to the solution of a finite difference schemes, the results are in some agreement.

However, I recognized the following:

Instead of using threadsPerBlock*blocksPerGrid (times sizeof(curandState)) to allocate memory I can use an arbitrary number (e.g. 1, 200 or even floats(!)). It does not have any impact on the results and does not produce any compiler errors of segmentation faults during runtime.
What is wrong with my code?

The second observation is that if I re-seed the number generator inside the inner loop (e.g. using a seed like seed+mti+b) gives a much much smoother result, i.e. a much better statistics.

setup_kernel<<<blocksPerGrid, threadsPerBlock>>>(devStates, seed+mti+b);

So my suggestion is, that the kernel eventually starts to produce the same random numbers again?

However, something seems to be wrong with the allocation of curandState in my code that may cause the problems.

Any comments are highly appreciated.

Thank you,


I have also been experiencing weird behaviour with CURAND. My code is very similar to yours, and I will sometimes get results that seem like the samples are all being drawn from the same state. I really have no idea what is causing it, but it seems to be fixed when I change my threads per block down to a lower number.

If your code carefully checks all status returns from CUDA, CURAND and other APIs, and no errors are reported, I would suggest filing a bug report using the bug reporting form linked from the registered developer website.

The problem may be a usage issue rather than a bug in CURAND, but I cannot tell either way. If it is in fact usage issue, we may want to improve the documentation, if it turns out to be a bug, we would want to fix it. Please attach a self-contained repro app to your bug report. Thanks!

Thank your very much for your replies. I will start writing a minimal program to illustrate my observations and will check all possible status returns.