This is indeed a bit confusing, and at the very least warrants better documentation. Itâ€™s also made more complicated by a bug in the way offsets are handled in the normal and log normal distributions. To understand why you canâ€™t get back in sync unless you draw multiples of 4096, hereâ€™s an example of what happens when you draw uniform doubles on an XORWOW generator.
The curand host API invokes the kernel generators with a block count of 64 and a thread count of 64. Each of the 4096 threads is operating on a state in a different part of the main sequence, spaced apart by 2^67. The basic issue is that when you generate integers or uniform floats, each thread advances its state by 1 for each output sample. When you generate uniform doubles with XORWOW, each thread advances its state by two, so unless each thread produces the same number of output samples, some threads get out of sync (i.e. they are too far ahead of the other threads)
Suppose all threads start at position zero. If you draw 6144 integers, all 4096 states get advanced by one, and states 0…2047 are advanced a second time to state 2. If you draw 4096 more samples, curand starts with state 2048 and advances states 2048…4095 to state 2, and wraps to advance states 0…2047 to state 3.
When you call curandGenerateUniformDouble with XORWOW, each thread draws two integers, which it combines to make a 53 bit mantissa. Suppose again you start with all threads at position zero. If you draw 2048 samples, 2048 of the threads advance their state to position 2, and the remaining threads (2048…4095) are still at position zero. Now if you draw 4096 integers, states 2048…4095 get advanced to state 1, and states 0…2047 get advanced to state 3. Thereâ€™s no way (short of a lot of complicated book-keeping) to get back to the â€œusualâ€ situation where some of the states are only one ahead of the others. This is why, as you have observed, you can only get offsets to work if you draw multiples of 4096 when calling curandGenerateUniformDouble with an XORWOW generator.
When you generate normals or log-normals with XORWOW or MRG32K3A, curand uses the Box-Muller transform to get from the uniform distribution to the normal/log-normal distribution. In this case, each thread draws two uniforms, and returns two results. Since each thread returns two results, you have to draw 8192 samples to cycle through and advance all 4096 states.
As I mentioned this is all made more complicated by a bug I found while looking at the normal/log-normal code, to see whatâ€™s happening. The workaround for that bug, to get offsets to work, is to generate something other than normal or log-normal results the first time you invoke curand after creating the generator or setting the offset.
To summarize then,
For XORWOW uniform double you need to draw multiples of 4096
MRG32K3A draws a single result to generate a uniform double, so it does not have the above restriction.
For XORWOW and MRG32K3A, normal and log-normal, single and double, you need to draw multiples of 8192.
For XORWOW all double precision results draw two samples for each result.
For MRG23K3A all results draw one input sample for each result.
Thatâ€™s the situation as of CUDA 4.2. Iâ€™ve fixed the offset bug in normal and log-normal, and expect that fix to get into CUDA 5.0. That would eliminate the need to generate ints or uniforms before generating any normals or log-normals
For the future, I could envision adding an ordering type to curandSetGeneratorOrdering, that would cause curand to use only one sample per result, and to use the icdf method for generating normal/log-normal.
This would effectively eliminate all of the above restrictions. Generating n samples of any distribution would result in the same state as if you created a new generator and set the offset to n. Would this resolve your issues, or is there something more we should be providing? Weâ€™re always interested in feedback on our library functions.