I’m computing a 1024x1024 output map of random numbers ( using curand_uniform and curand_normal ).

Generating the curandStates for the map ( currendtly one per element ) takes some time. If i use the same state for several elements, ie generating less states, will that still yield high quality psuedorandom numbers ?

Initialize the 1024*1024 curandState structs once, and then reuse them across the generation of several output maps. Make sure to use the same seed for all threads, but different sequence numbers, and also be sure to save the curandState back to global memory at the end of the kernel, so the next kernel will pick up where the previous one left off.

Use fewer threads to generate your output map in order to amortize the initialization over more threads. Instead of using 1024*1024 threads, use 1024 threads and have each thread produce 1024 random numbers. You’ll want to use a small block size (32 or 64) to ensure you have at least as many blocks as multiprocessors. (You can play with the ratio, of course. Create 2048 threads and have each produce 512 numbers, etc…)

Ah thank you Seibert! You are helpful as allways :-)

Im currently doing as you suggested in option one. Generating the curand states in “offline mode” in a separate kernel and then reusing the states in my “real time mode” (ie being reused many many times).

BUT from my interpretation of option #2 I could actually just produce a fraction of the number of curand states and still obtain good results?

This could be a nice option since i could afford spending 1-2 ms on generating the states on-chip and as you said paying it all of by reusing the already initialized states

The general idea is to have PRNG state per thread, so that the streams of random numbers generated for each thread are independent of one another. One common method of achieving this is to assign separate, non-overlapping portions of the very large state space of a single PRNG to each thread.

To do so, the state of the PRNG is typically represented as a matrix, and the starting state for each thread is advanced to the desired position inside the PRNG state space by matrix exponentiation to advance the PRNG state by N steps (this can be done in O(log(N)) time).

Compared to advancing the PRNG state by 1 step, this is typically quite expensive, so one would want to amortize the cost of determining the initial PRNG state for each thread by generating a fair number of random numbers per thread. You might want to experiment to see which number of threads gives the best performance, but I would shoot for hundreds of random numbers generated per thread.