how to get same output by CURAND in CPU and GPU

I want to get same random number sequence by CURAND in CPU and GPU (for one thread in both cases). But they differ by ‘regular’ way (as in CURAND guide). The code I used for CPU is:

// compile: nvcc thisfile.c -lcurand

#include <stdio.h>

#include <curand.h>

int main()

{

  int i;

  double data;

  curandGenerator_t gen;

curandCreateGeneratorHost(&gen, CURAND_RNG_PSEUDO_DEFAULT);

  curandSetPseudoRandomGeneratorSeed(gen,1234);

  for(i=0;i<10;i++){

    //curandSetGeneratorOffset(gen,i*8192); // 2*4096

    curandGenerateUniformDouble(gen, &data, 1);

    printf("%lf ",data);

  }

  printf("\n");

  curandDestroyGenerator(gen);

  return 0;

}

The GPU code is:

// compile: nvcc thisfile.cu

#include <stdio.h>

#include <curand_kernel.h>

__global__ void kernel()

{

  int i;

  curandState rngState;

curand_init(1234,0,0,&rngState);

  for(i=0;i<10;i++){

    //curand_init(1234,i,0,&rngState); // i: sequence number

    printf("%lf ",curand_uniform_double(&rngState));

  }

  printf("\n");

}

int main()

{

  int *foo; // for in-kernel printf

  cudaMalloc(&foo,sizeof(int));

  kernel<<<1,1>>>();

  cudaFree(foo);

}

However, if I (1) remove the curand_init before for loop in the GPU code and use the curand_init (now commented) inside the loop, their results are same. Or, if I (2) start to use curandSetGeneratorOffset in the CPU code (now commented), the CPU output became same as that of GPU. Does that mean GPU and CPU versions are using same sequence, but CPU version is picking numbers from the sequence every 8192 numbers (the 2nd way), or alternatively, (as the 1st way) CPU version is picking numbers from the sequence every 2^67 numbers (as stated in here)? and why it does this? It looks strange…

The code above is for double. For float, only the 1st way work. To make 2nd way work too, offset factor needs to be changed from 8192 to 4096.

Anyone know the answer to the title, or is it a reason few people using CURAND?

The quick answer: the simplest way to get the same results on the CPU and GPU is to use the host API. This allows you to generate random values into memory on the CPU or the GPU, the only difference is whether you call curandCreateGeneratorHost() versus curandCreateGenerator().

To get the same results from the host API and the device API is a bit more work, you have to set things up carefully. The basic idea is that mathematically there is one long sequence of pseudorandom numbers. This long sequence is then cut up into chunks and shuffled together to get a final sequence that can be generated in parallel.

For CURAND, we use the seed to pick a random spot to start, then cut the long sequence into 4096 chunks each spaced 2^67 positions apart. The host API lets you grab blocks of results from this shuffled sequence. If you request 8192 results, you will get the first result from each of the 4096 chunks, then the second result from each of the 4096 chunks.

For the device API using curand_init(), you explicitly give the subsequence number and manage the threads yourself. If you want to exactly match the results from the host API you need to launch 4096 total threads, then have each one call curand_init() with the same seed and subsequence numbers from 0 to 4095. Then you need to store the results in a coalesced strided manner; thread 0 goes first with one value, then next in memory is thread 1 with one value, then thread 2, etc…

The reason you are seeing the number 8192 is because you are generating double precision values. Each double result uses 2 32-bit results.

Let me know if that doesn’t explain things.