random number generation generating random numbers in CUDA

osiris1 · April 29, 2007, 2:56am

Hi Brian, well we both got it wrong! I am surprised no-one else posted a correction… (there are race conditions in both our posts - shows just how easy it is to make a mistake on this architecture!) I have come back to this just recently and have implemented a full set of library routines for a variable number of threads that are quite generally useful. I am waiting on a critique from Makoto Matsumoto before posting.
Cheers, Eric.

bbudge · April 30, 2007, 5:31am

You’re definitely right. I saw your message, and I’m highly surprised that I haven’t seen problems during experimentation.

I look forward to seeing your code. I’ll try to benchmark it for you, unless you’ve got your hands on a G80 recently.

Brian

osiris1 · April 30, 2007, 8:04am

Hi Brian, the reason you have not seen problems is the __syncthreads() at the start (I was wondering why it was there, perhaps you put it in and it started working all of a sudden?) anyway there are no instructions from the __syncthreads() to the problem area that would stall a warp (just shared mem access) so the hardware must be issuing the warps after a __syncthreads() in numerically increasing order, which is the obvious thing to do. One cannot rely on this when the architecture spec says warps are issued in a non deterministic order, just this is the way the G80 works, might get broken on G90.

If you would reply to my email I will send the code (not ready to post just yet).
Cheers, Eric

ed: it might also have something to do with the 312 threads which means a multiprocessor can only run one block at a time.

osiris1 · May 6, 2007, 8:54pm

Here it is (!) multithreaded C original Mersenne Twist RNG.

I have implemented 2 sets of routines here, a totally shared memory set for those using lots of randoms (uses 2. 5Kb of shared memory) and a global memory based set that only uses 2 words of static shared memory plus the number of threads + 1, up to 228 words, of auto allocated shared memory, for more general use. The globals version does 2 global reads and one global write per result. Initialisation is all single threaded as it is only called once. Since most people will be using less than 228 threads I have provided a version of each, mt19937s for shared and mt19937g for global, that is optimised for 1-227 threads. An updated version of the routine posted above is also included for reference that is good for 227-312 threads.

It is a fundamental limitation of the algorithm that no more than 227 results may be calculated in parallel. Basically the point of feedback in the shift register. Still this beats 1 for most algorithms!

The routines have been tested to produce bit identical output to the the original single threaded C reference source for number of threads up to 1000. This is sufficient for all number of thread ranges. Most of the work was infact the verification and where most of the value is. As has been demonstrated above, running on a G80 is no test of correctness in this new world of concurrent hardware based threading with shared memory. Running under emulation is even less. I have used another technique that is more rigorous and hence I am prepared to post without having yet run on any hardware. I know this sounds strange. I waiting for the next release of CUDA. Brian had some problems and could not test.

Of most interest to me here is correctness. Please feel free to suggest any performance or implementation tweaks that occur to you while reading or testing, and report any problems! I believe the syncs used are necessary and sufficient for correct operation. Anyone who has time to run some benchmarks please post the results. I expect the best total random production rate will be the mt19937s() routine running 192 or 224 threads and 32 blocks on a 8800GTX.

You will have to agree that this multithreaded C expression is quite readable in comparison to the C implementation of the new SFMT. I will be providing a covering writeup of these implementations for the original Mersenne Twist web site - for final release.

The copyright and licence (anyone use for anything for free) are, I believe, consistent with the terms and conditions of this forum. It cannot be reproduced without the original copyright for the single threaded C reference as this is, I believe, a derived work. While it has the original authors copyright, they have not tested it.

Eric

Enjoy :)

osiris1 · May 6, 2007, 9:00pm

The web interface dropped the file with a “.cu” extension!

[attachment=3331:attachment]

ed: leaving this original post here for reference - see later for latest - only problem here was verification was done on one block & globals versions need indexing by blockIdx.x.

mfatica · May 6, 2007, 11:32pm

Thanks Eric, nice job !!!
If you could add a code for benchmarking, it will be even better.

The version that returns 2 numbers at the time could be easily combined with a Box-Muller transformation to generate gaussian random numbers.
Very handy for Montecarlo option pricing.

Massimiliano

osiris1 · May 7, 2007, 2:20am

Thanks, Massimiliano. The only code you need is a for loop!

In the template sample - include my file before template_kernel.cu in template.cu then modify testKernel() to invoke whichever initialiser and then a big for loop to call the test routine and work out performance from the reported timings. One could delete the rest of the code from testKernel() and back in runTest() in template.cu but they probably don’t consume much. This does not belong in a reference implementation!

Code to actually do a simple verification is more involved and should be packaged separately as a SDK sample? Problem is that it is so fast most of the time would be spent copying data back to the host for verification.

Eric

ed: OK you need to fool the compiler not to optimise out the tempering. This works:

_shared__ uint j;

__global__ __device__ void

testMT()

{

    int         i;

   mt19937si(TEST_KEY);

    for (i = 0; i < 100; ++i)

        j = mt19937s();

}

ed: I do have an SDK sample project test harness for Linux to benchmark - email me if you want it - bit of a hack so I don’t want to post it yet

ed: no feedback yet - there has been a minor improvement in global versions to get all device mem refs in strict sequential order, cost 1 mod, no-one will say how much…

osiris1 · May 24, 2007, 7:48am

No comments yet, however I thought it worthwhile reposting for all the updates mentioned above:

replace mod with switchable macro for G80, wherever possible as it is so slow
fix global versions to do the best possible for coalescing - strict sequential device access in 1 instruction - unfortunately never will be able to ensure alignment as it changes all the time while running. Still not able to estimate the penalty here - awaiting on responses for another thread (and others have brought up the same issue in separate threads).
re-organise the long versions to save 5 registers in 0.8 beta, should be equivalent but makes a difference for some reason.

I have no feedback from takers of the SDK project so cannot post yet - though it has also been updated with a separate kernel for each test.

I have been told that this code does not compile under one of the pre-release toolsets due to an incompatibility between the compiler and assembler. Perhaps someone from Nvidia would care to test before the 1.0 release.

Thanks, Eric

ed: have done what I think is a final pass to reduce divergent code to a minimum and re-verified (except can’t verify past 624 threads with NVG80 mod patch switch on & don’t need > 512). Cost additional smem location in shared versions. I used constant mem to avoid the compiler problem above - it has a 2way conflict that I expected would be about the same cost as a compare zero (except the current compiler does not just do a compare & predicated xor as one would expect from the guide). Register use is also reduced further <=9 for all versions so 100% occupancy is possible in 0.8:

ed: attachment removed - see next + 1 post

osiris1 · June 2, 2007, 1:00am

Pretty much done now: I have some feedback on the benchmark (running the code last posted):

Function    Threads  Blocks  GPU_clocks  Msamples/sec

Shared            1       1      742808             2

Shared            2       1      780344             3

Shared           32       1     1207104            32

Shared          224       1     2134804           127

Shared          192      48     6244886          1793

Shared          224      48     5075176          2574

Shared          192      64     5906718          2528

SharedL           1       1     1089384             1

SharedL           2       1     1127116             2

SharedL          32       1     1568296            25

SharedL         224       1     2494802           109

SharedL         512      16     6925714          1437

SharedL         384      32     8389778          1780

SharedL         256      48     8379398          1782

Global            1       1     1629040             1

Global            2       1     1657436             1

Global           32       1     2056432            19

Global          192      48    17189732           651

Global          224      48    17397454           751

Global          192      64    20116224           742

GlobalL           1       1     2008976             1

GlobalL           2       1     2079852             1

GlobalL          32       1     2708148            14

GlobalL         224       1     4340944            63

GlobalL         512      16    16001204           622

GlobalL         384      32    20548226           727

GlobalL         256      48    21099388           708

Wide            312       1     4475348           169

Wide            312      32     8184578          2964

Processing time: 227.964996 (ms)

This was run on pre-rel tools so they are new GPU clocks and it looks like the shared mem alloc bug has been fixed as can make 100% occupancy @ 192 threads. So the fastest is the single shot 312 thread version @ 81% occupancy. Perhaps you are better off with running fewer bigger blocks, even if this means you don’t make 100% occupancy. Odd that the shared mem version with 19248 (75% occupancy) has a longer overall run time than 19264 (100% occupancy) - must be a problem in the hardware scheduler or onboard executive? This is the code that was run on an 8800GTX:

ed: had a query and there was a typo in the test harness…

ed: I have deleted the attachment - see next post

osiris1 · July 2, 2007, 11:21am

Thought I should post an update for V1.0 and also have implemented simple verification testing since I can test now. Lots of interesting performance issues - like some configurations are faster returning data in global memory than in shared memory - strange. An 8800 Ultra should get up to 5x10^9 MT randoms/sec here… (slightly better performance in 32 bit mode).

I am surprised no-one commented upon the fact that previous posts did not separate global state memory for each block (in the global versions)- either no one has used or not tested or could not be bothered.

Perhaps someone will do SFMT for everyone. I do not need it.

This implementation generates a single sequence up to 512 threads wide that I believe is more correct to use than the MT in the SDK.

Eric

[attachment=3791:attachment]

ed: I replaced this attachment after finding elsewhere that long term timing using the clock register was unreliable. Timing now done with the host clock otherwise no change to the reference library code.

lutorm · September 3, 2007, 9:05pm

Hey all,

Just started playing with CUDA to eventually try to get my MC ray tracer ported. Now I wanted to try out the random number generators, so I downloaded your implementation and tested it. I’m using mt19937gi and mt19937gl, and I’m getting the random numbers repeated for each block. Looking at mt19937gi, it seems that each block will be initialized with the same seeds, so no wonder. What’s the proper way of initializing the generators when running several blocks?

Thanks,

/Patrik J.

vpodlozhnyuk · September 3, 2007, 9:59pm

It is methodologically incorrect to run the same Mersenne Twister on each parallel unit supplying just different seeds, since the outputs of the units are not guaranteed to be independent. Each parallel unit should have individual twister parameter values. DCMT library, developed by the inventors of Mersenne Twister themselves, is designed exactly for this purpose.

lutorm · September 3, 2007, 10:38pm

Ok… I don’t understand how that helps me… Are you saying that initializing different blocks with different seeds can give them the same state?

If you want to be picky, I should have said that the different blocks have the same STATE then. But my question was about HOW TO generate a correct initialization over several blocks. Since mt19937gl handles multiple blocks when generating, and there is a separate state vector for each block, I assume there’s also a way to init them correctly?

vpodlozhnyuk · September 3, 2007, 10:57pm

Well, as a special case, of two different MT states one can possibly be just an evolution of the other, and apriori we can hardly tell anything by just looking. However in general case “correlated” is not a synonym to “identical”. To put it simply, the whole result (combined from the block results) can be “not very random” as a single sequence, with possible “similarities” between the subsequneces formed by the block results – thus not very suitable for applications like Monte-Carlo simulation. MT19937 and all the other reference sources from Matsumoto & Nishimura are single-threaded, so don’t suffer from such problems. In order to avoid this methodological difficulty, DCMT library was used to spawn independent MT parameters for each thread in the MersenneTwister CUDA SDK sample. For more information please refer to the sample(which is ready to use) and the whitepaper.

lutorm · September 4, 2007, 1:35am

Well, as a special case, of two different MT states one can possibly be just an evolution of the other, and apriori we can hardly tell anything by just looking. However in general case “correlated” is not a synonym to “identical”. To put it simply, the whole result (combined from the block results) can be “not very random” as a single sequence, with possible “similarities” between the subsequneces formed by the block results – thus not very suitable for applications like Monte-Carlo simulation. MT19937 and all the other reference sources from Matsumoto & Nishimura are single-threaded, so don’t suffer from such problems. In order to avoid this methodological difficulty, DCMT library was used to spawn independent MT parameters for each thread in the MersenneTwister CUDA SDK sample. For more information please refer to the sample(which is ready to use) and the whitepaper.

[snapback]245961[/snapback]

I see what you are saying, but I still don’t understand how to do it. I looked at the MersenneTwister example, and it seems to just load the states from a precomputed file? In any case, in this code the state is kept on the stack so there’s no way to explicitly upload a state from the host. Are you saying that the code posted here is incorrect then?

(I should say that for now I’m not too concerned about the quality of the random numbers. I’m just exploring currently.)

osiris1 · September 4, 2007, 1:48am

You have to organise to call mt19937gi with different seeds for each block or call mt19937ga for more than 2^32 starting points in the series.

Yes, MT does have good independence between different parts of the series but one should not use different parts of the same series in the same MC calculation. I my particular case this was all I needed as each block was running an independent calculation.

Well, as a special case, of two different MT states one can possibly be just an evolution of the other, and apriori we can hardly tell anything by just looking. However in general case “correlated” is not a synonym to “identical”. To put it simply, the whole result (combined from the block results) can be “not very random” as a single sequence, with possible “similarities” between the subsequneces formed by the block results – thus not very suitable for applications like Monte-Carlo simulation. MT19937 and all the other reference sources from Matsumoto & Nishimura are single-threaded, so don’t suffer from such problems. In order to avoid this methodological difficulty, DCMT library was used to spawn independent MT parameters for each thread in the MersenneTwister CUDA SDK sample. For more information please refer to the sample(which is ready to use) and the whitepaper.

Nvidia cheated in the SDK by precalculating the parameters for a set of small (short period) MT series then saving those parameters to a file and distributing those. Most all the CPU time in the example is required for calculating the MT parameters. One can get better performance from larger Mersenne primes but the calculation time using the DCMT library rapidly escalates (they estimate O(n^3)). My quick test showed that MT4253 was the smallest series that allows calculation of 64 results in parallel which is the smallest useful block size on G80 (this took 2 mins of CPU per set of parameters). State is only 2.3 words/thread which is similar to MT19937. The only parameters that change between different MT series for a given prime are MATRIXA and the 2 tempering masks. Perhaps someone should publish a set of these for a reasonable sized prime.

Eric

ed: Interesting config is MT2203 that allows 32 results in parallel and a syncless implementation on G80 but will take 1 1/2 hrs normal CPU to calculate the parameters for 100% occupancy (384 warps).

vpodlozhnyuk · September 4, 2007, 7:51am

Nvidia cheated in the SDK by precalculating the parameters for a set of small (short period) MT series then saving those parameters to a file and distributing those. Most all the CPU time in the example is required for calculating the MT parameters. One can get better performance from larger Mersenne primes but the calculation time using the DCMT library rapidly escalates (they estimate O(n^3)). My quick test showed that MT4253 was the smallest series that allows calculation of 64 results in parallel which is the smallest useful block size on G80 (this took 2 mins of CPU per set of parameters). State is only 2.3 words/thread which is similar to MT19937. The only parameters that change between different MT series for a given prime are MATRIXA and the 2 tempering masks. Perhaps someone should publish a set of these for a reasonable sized prime.

Eric

ed: Interesting config is MT2203 that allows 32 results in parallel and a syncless implementation on G80 but will take 1 1/2 hrs normal CPU to calculate the parameters for 100% occupancy (384 warps).

[snapback]246022[/snapback]

Well,… cheating is a bit too strong word, IMO. External Media The DCMT library is intended for such offline precalculation, simply because (just as you say) it’s extremly time-consuming; saving MT configurations for future reuse is as fraudulent as precalculating quotients for a polynomial approximation. Once the MT parameters are precalculated, the twisters can be initialized with arbitrary seeds (states) and run in parallel without correlation. :magic:

P.S. If you are interested, I can post the exact C-program that generated MersenneTwister.dat file.

vpodlozhnyuk · September 4, 2007, 8:11am

In the SDK sample the states are initialized from the seed, but you can easily modify the sample to simply download the states from the host. The states are different from the parameters. Though the latter is precalculated, the former is supplied by the user, so it’s an RNG implementation in the true sense of the word.

If osiris still uses the same twister parameters for all threads/blocks, then it’s incorrect.

osiris1 · September 4, 2007, 8:28am

Yeah… but we want to see all the code that was used. That was the point! (You added that offer after I started writing)… I don’t need to see it as I presume it is based on dcmt0.4. Perhaps it should have been in the SDK. Still I like being able to see all the code I am using on one page - the SDK is a bit messy. I guess that in most applications the problem with using different parts of the same series is that you don’t know how far apart the starting points are for each seed and the real danger is repeating yourself if they are too close.

Just for interest I ran 1 set of parameters for each Mersenne prime and I estimate the computation time is exponential with p, not O(p^3). I now have a set of params for MT19937 that does allow 256 results in parallel which is of interest, except I am not using (G80) at present. MT19937 took 1 hour of CPU for 1 set.

Eric :)

lutorm · September 4, 2007, 7:03pm

Yeah… but we want to see all the code that was used. That was the point! (You added that offer after I started writing)… I don’t need to see it as I presume it is based on dcmt0.4. Perhaps it should have been in the SDK. Still I like being able to see all the code I am using on one page - the SDK is a bit messy. I guess that in most applications the problem with using different parts of the same series is that you don’t know how far apart the starting points are for each seed and the real danger is repeating yourself if they are too close.

Just for interest I ran 1 set of parameters for each Mersenne prime and I estimate the computation time is exponential with p, not O(p^3). I now have a set of params for MT19937 that does allow 256 results in parallel which is of interest, except I am not using (G80) at present. MT19937 took 1 hour of CPU for 1 set.

Eric :)

[snapback]246114[/snapback]

Ok, I’m beginning to grasp what’s going on, thanks for patiently trying to explain! External Media I never looked into the details of the MT19937 but am I correct in thinking that there are many different possible MT19937 RNGs? So the idea is to have each thread (or each block, I’m still hazy on this point…) an actual different RNG from the others. So that even if they have the same state, they give different numbers?

In osiris’s example, you seed each block, and there is only one state per block. Does that mean that all threads in a block are running the SAME RNG, cleverly implemented to be thread-safe? I’m still confused about this part.

And yes, I would be interested in the MT19937 parameters you calculated.

Topic		Replies	Views
random numbers inside the Kernel CUDA Programming and Performance	31	60698	November 26, 2011
Another Random Number Thread CUDA Programming and Performance	15	3346	August 13, 2009
Requesting advice on kernel efficiency CUDA Programming and Performance	14	673	October 10, 2019
one kernel two codes - cuda arch 1.1 CUDA Programming and Performance	32	2932	June 19, 2011
Global thread barrier CUDA Programming and Performance	78	85677	December 23, 2011
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	623	June 3, 2024
uncorrelated random numbers CUDA Programming and Performance	15	5406	June 29, 2011
Kernel runs successfully a few times, then crashes CUDA Programming and Performance	2	1832	August 18, 2009
atomicadd for double precision in CUDA Fortran Legacy PGI Compilers	20	21605	November 15, 2013
Where's a PTX ISA Virtual Grid ID? Special Reg %gridid is "temporal".. CUDA Programming and Performance	7	2353	January 23, 2012

random number generation generating random numbers in CUDA

Related topics