In the doc it says:
If an experiment spans multiple kernel launches, it is recommended
that threads between kernel launches be given the same seed, and sequence
numbers be assigned in a monotonically increasing way
This is how I’m doing it in a simulation kernel:
template <typename RNG>
__global__ void initCurand(RNG* state, int seq_offset) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx >= NS)
return ;
curand_init(1984, idx + seq_offset, 0, &state[idx]);
}
In the main simulation step function:
int NS = 1000; // number of parallel streams i.e. random generators I need
cudaMalloc(&d_state, NS * sizeof(RNG));
......
for (int iter = 0; ; ++iter) {
// Initialize curand states
initCurand<<<blocksPerGrid, threadsPerBlock>>>(d_state, NS * iter);
When using xorwow, I believe I’m getting streams with correlation between them. It fails PractRand test only after 64MB of data. Philox and MRG32 generators seem okay.
Here’s a link to full running code: Sep 12 7:42 PM - Codeshare
I tested with nvcc 12.2, gcc 11.4.
Edit: the tests it fails are follows:
length= 64 megabytes (2^26 bytes), time= 8.3 seconds
Test Name Raw Processed Evaluation
[Low8/32]BRank(12):768(1) R= +2628 p~= 2.9e-792 FAIL !!!!!!!
[Low1/32]BRank(12):384(1) R= +4695 p~= 2e-1414 FAIL !!!!!!!!
...and 140 test result(s) without anomalies