Problem with generating number with Mersenne Twister (CURAND function)

I got a problem with generating number through MTGP32 generator. I already worked with XORWOW or MG32k3a, so I proceed the same way. When I enter a kernel, I copy the state in a local variable and then I work on it. Here I try to do the same thing but the generator keep giving the same random numbers while when I work with a pointer, it’s all OK. Here’s the code with the copy:

__global__ void generate_kernel( curandStateMtgp32 *state,
                                int n )
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    float x;

        curandStateMtgp32 localState = state[blockIdx.x];

    /* Generate pseudo-random normal variable */
    for(int i = 0; i < n; i++) {
        x = curand_normal( &localState );
                printf("tid: %d x: %f\n", id, x);
        }
}

And here is the output (tid 1 & 2 got same results each time):

tid: 0 x: 0.207837
tid: 1 x: -0.091346
tid: 2 x: 0.294019
tid: 0 x: 2.684819
tid: 1 x: -0.091346
tid: 2 x: 0.294019
tid: 0 x: 1.433268
tid: 1 x: -0.091346
tid: 2 x: 0.294019

While when I’m working with a pointer, the result is correct. Here’s the code:

__global__ void generate_kernel( curandStateMtgp32 *state,
                                int n )
{
    int id = threadIdx.x + blockIdx.x * blockDim.x;
    float x;

        curandStateMtgp32 * localState = &state[blockIdx.x];

    /* Generate pseudo-random normal variable */
    for(int i = 0; i < n; i++) {
        x = curand_normal( localState );
                printf("tid: %d x: %f\n", id, x);
        }
}

and the results are:

tid: 0 x: 0.207837
tid: 1 x: -0.091346
tid: 2 x: 0.294019
tid: 0 x: 2.684819
tid: 1 x: -1.183960
tid: 2 x: -0.621348
tid: 0 x: 1.433268
tid: 1 x: 0.571323
tid: 2 x: -0.735758

Can someone explain me what I’m doing wrong or if it’s a bug from the compiler? I don’t understand why when I’m working with a copy of the state, the first thread got different numbers while the others don’t.

Thank you.

I can post the whole code if you want to test it from yourself.

I’m working with RED HAT 6.x - GPU K20xm - CUDA 5.5 compilation line:

nvcc -arch=sm_35 -lcurand x.cu

You want a random number generator state per thread, but your code only has a state per block:

curandStateMtgp32 localState = state[blockIdx.x];

I assumed you would want something like:

int id = threadIdx.x + blockIdx.x * blockDim.x;
curandStateMtgp32 localState = state[id];

Right? (Assuming your state array has been initialized for blockDim.x * gridDim.x elements).