Curand: potential correlation among multiple parallel streams using Xorwow generator

In the doc it says:

If an experiment spans multiple kernel launches, it is recommended
that threads between kernel launches be given the same seed, and sequence
numbers be assigned in a monotonically increasing way

This is how I’m doing it in a simulation kernel:

template <typename RNG>
__global__ void initCurand(RNG* state, int seq_offset) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if(idx >= NS)
        return ;
    curand_init(1984, idx + seq_offset, 0, &state[idx]);
}

In the main simulation step function:

int NS = 1000; // number of parallel streams i.e. random generators I need
cudaMalloc(&d_state, NS * sizeof(RNG));
......
for (int iter = 0; ; ++iter) {
        // Initialize curand states
        initCurand<<<blocksPerGrid, threadsPerBlock>>>(d_state, NS * iter);

When using xorwow, I believe I’m getting streams with correlation between them. It fails PractRand test only after 64MB of data. Philox and MRG32 generators seem okay.

Here’s a link to full running code: Sep 12 7:42 PM - Codeshare

I tested with nvcc 12.2, gcc 11.4.

Edit: the tests it fails are follows:

length= 64 megabytes (2^26 bytes), time= 8.3 seconds
  Test Name                         Raw       Processed     Evaluation
  [Low8/32]BRank(12):768(1)         R= +2628  p~=  2.9e-792   FAIL !!!!!!!   
  [Low1/32]BRank(12):384(1)         R= +4695  p~=  2e-1414    FAIL !!!!!!!!  
  ...and 140 test result(s) without anomalies