Hello all,
I am writing a program to simulate the random walk of a number of particles (i.e a stochastic differential equation) using the CURAND library.
In principal, I define an interval (say from -1 to 1), define a number of injection points in the interval and inject a number of particles at this points and let them perform a random walk and register the time when a particle hits one of the boundaries.
However, I am facing a problem with my code I’ll try to explain in what follows:
This is my kernel to initialize the PRNG. The kernel is obviously borrowed from the CURAND doc:
__global__ void setup_kernel(curandState *state, long long seed) {
int id = threadIdx.x+blockIdx.x*blockDim.x;
curand_init(seed, id, 0, &state[id]);
}
The code from main() that is important to the problem is as follows:
curandState *devStates;
CHECK_CUDA_ERROR(cudaMalloc((void**)&devStates, threadsPerBlock*blocksPerGrid*sizeof(curandState)));
int nParts = 2000;
int tSteps = 20000;
int mt = 100;
int tStepsRed=tSteps/(mt);
int threadsPerBlock = 512;
int blocksPerGrid =(nParts + threadsPerBlock - 1) / threadsPerBlock;
setup_kernel<<<blocksPerGrid, threadsPerBlock>>>(devStates, seed);
for (b=0; b<nBins; b++) { // loop over all injection points
for (mti=0; mti<tStepsRed; mti++) { // I divide tSteps into mt chunks to write the output to file
// copy xold_D etc. to the device...
random_walk<<<blocksPerGrid, threadsPerBlock>>>(devStates, xold_D, nParts, mt,
sqts, partCount_D, bound1, bound2, diffu);
// copy everything back to the host
}
}
CHECK_CUDA_ERROR(cudaFree(devStates));
cudaDeviceReset();
The kernel random_walk looks like this:
__global__ void random_walk(curandState *globalState, float *xold, int nParts, int tSteps, float sqts, int *partCount, float bound1, float bound2, float diffu) {
int t;
int pc;
float x;
float sdi = sqts*diffu;
int id = blockDim.x * blockIdx.x + threadIdx.x;
curandState localState = globalState[id];
if (id < nParts) {
partCount[id]=0;
pc = partCount[id];
x = xold[id];
for (t=0; t<tSteps; t++) {
if (x > bound1 && x < bound2) {
x = x+sdi*curand_normal(&localState); // this is a simple random walk
if (x <= bound1) pc=1;
// bound2 does not contribute, but particle that cross bound2
are lost
}
}
}
partCount[id] = pc;
xold[id] = x;
globalState[id] = localState;
}
When I run the code and compare it to the solution of a finite difference schemes, the results are in some agreement.
However, I recognized the following:
Instead of using threadsPerBlock*blocksPerGrid (times sizeof(curandState)) to allocate memory I can use an arbitrary number (e.g. 1, 200 or even floats(!)). It does not have any impact on the results and does not produce any compiler errors of segmentation faults during runtime.
What is wrong with my code?
The second observation is that if I re-seed the number generator inside the inner loop (e.g. using a seed like seed+mti+b) gives a much much smoother result, i.e. a much better statistics.
setup_kernel<<<blocksPerGrid, threadsPerBlock>>>(devStates, seed+mti+b);
So my suggestion is, that the kernel eventually starts to produce the same random numbers again?
However, something seems to be wrong with the allocation of curandState in my code that may cause the problems.
Any comments are highly appreciated.
Thank you,
Phillip