Hi Brian, well we both got it wrong! I am surprised no-one else posted a correction… (there are race conditions in both our posts - shows just how easy it is to make a mistake on this architecture!) I have come back to this just recently and have implemented a full set of library routines for a variable number of threads that are quite generally useful. I am waiting on a critique from Makoto Matsumoto before posting.
Cheers, Eric.
You’re definitely right. I saw your message, and I’m highly surprised that I haven’t seen problems during experimentation.
I look forward to seeing your code. I’ll try to benchmark it for you, unless you’ve got your hands on a G80 recently.
Brian
Hi Brian, the reason you have not seen problems is the __syncthreads() at the start (I was wondering why it was there, perhaps you put it in and it started working all of a sudden?) anyway there are no instructions from the __syncthreads() to the problem area that would stall a warp (just shared mem access) so the hardware must be issuing the warps after a __syncthreads() in numerically increasing order, which is the obvious thing to do. One cannot rely on this when the architecture spec says warps are issued in a non deterministic order, just this is the way the G80 works, might get broken on G90.
If you would reply to my email I will send the code (not ready to post just yet).
Cheers, Eric
ed: it might also have something to do with the 312 threads which means a multiprocessor can only run one block at a time.
Here it is (!) multithreaded C original Mersenne Twist RNG.
I have implemented 2 sets of routines here, a totally shared memory set for those using lots of randoms (uses 2. 5Kb of shared memory) and a global memory based set that only uses 2 words of static shared memory plus the number of threads + 1, up to 228 words, of auto allocated shared memory, for more general use. The globals version does 2 global reads and one global write per result. Initialisation is all single threaded as it is only called once. Since most people will be using less than 228 threads I have provided a version of each, mt19937s for shared and mt19937g for global, that is optimised for 1-227 threads. An updated version of the routine posted above is also included for reference that is good for 227-312 threads.
It is a fundamental limitation of the algorithm that no more than 227 results may be calculated in parallel. Basically the point of feedback in the shift register. Still this beats 1 for most algorithms!
The routines have been tested to produce bit identical output to the the original single threaded C reference source for number of threads up to 1000. This is sufficient for all number of thread ranges. Most of the work was infact the verification and where most of the value is. As has been demonstrated above, running on a G80 is no test of correctness in this new world of concurrent hardware based threading with shared memory. Running under emulation is even less. I have used another technique that is more rigorous and hence I am prepared to post without having yet run on any hardware. I know this sounds strange. I waiting for the next release of CUDA. Brian had some problems and could not test.
Of most interest to me here is correctness. Please feel free to suggest any performance or implementation tweaks that occur to you while reading or testing, and report any problems! I believe the syncs used are necessary and sufficient for correct operation. Anyone who has time to run some benchmarks please post the results. I expect the best total random production rate will be the mt19937s() routine running 192 or 224 threads and 32 blocks on a 8800GTX.
You will have to agree that this multithreaded C expression is quite readable in comparison to the C implementation of the new SFMT. I will be providing a covering writeup of these implementations for the original Mersenne Twist web site - for final release.
The copyright and licence (anyone use for anything for free) are, I believe, consistent with the terms and conditions of this forum. It cannot be reproduced without the original copyright for the single threaded C reference as this is, I believe, a derived work. While it has the original authors copyright, they have not tested it.
Eric
Enjoy :)
The web interface dropped the file with a “.cu” extension!
[attachment=3331:attachment]
ed: leaving this original post here for reference - see later for latest - only problem here was verification was done on one block & globals versions need indexing by blockIdx.x.
Thanks Eric, nice job !!!
If you could add a code for benchmarking, it will be even better.
The version that returns 2 numbers at the time could be easily combined with a Box-Muller transformation to generate gaussian random numbers.
Very handy for Montecarlo option pricing.
Massimiliano
Thanks, Massimiliano. The only code you need is a for loop!
In the template sample - include my file before template_kernel.cu in template.cu then modify testKernel() to invoke whichever initialiser and then a big for loop to call the test routine and work out performance from the reported timings. One could delete the rest of the code from testKernel() and back in runTest() in template.cu but they probably don’t consume much. This does not belong in a reference implementation!
Code to actually do a simple verification is more involved and should be packaged separately as a SDK sample? Problem is that it is so fast most of the time would be spent copying data back to the host for verification.
Eric
ed: OK you need to fool the compiler not to optimise out the tempering. This works:
_shared__ uint j;
__global__ __device__ void
testMT()
{
int i;
mt19937si(TEST_KEY);
for (i = 0; i < 100; ++i)
j = mt19937s();
}
ed: I do have an SDK sample project test harness for Linux to benchmark - email me if you want it - bit of a hack so I don’t want to post it yet
ed: no feedback yet - there has been a minor improvement in global versions to get all device mem refs in strict sequential order, cost 1 mod, no-one will say how much…
No comments yet, however I thought it worthwhile reposting for all the updates mentioned above:
- replace mod with switchable macro for G80, wherever possible as it is so slow
- fix global versions to do the best possible for coalescing - strict sequential device access in 1 instruction - unfortunately never will be able to ensure alignment as it changes all the time while running. Still not able to estimate the penalty here - awaiting on responses for another thread (and others have brought up the same issue in separate threads).
- re-organise the long versions to save 5 registers in 0.8 beta, should be equivalent but makes a difference for some reason.
I have no feedback from takers of the SDK project so cannot post yet - though it has also been updated with a separate kernel for each test.
I have been told that this code does not compile under one of the pre-release toolsets due to an incompatibility between the compiler and assembler. Perhaps someone from Nvidia would care to test before the 1.0 release.
Thanks, Eric
ed: have done what I think is a final pass to reduce divergent code to a minimum and re-verified (except can’t verify past 624 threads with NVG80 mod patch switch on & don’t need > 512). Cost additional smem location in shared versions. I used constant mem to avoid the compiler problem above - it has a 2way conflict that I expected would be about the same cost as a compare zero (except the current compiler does not just do a compare & predicated xor as one would expect from the guide). Register use is also reduced further <=9 for all versions so 100% occupancy is possible in 0.8:
ed: attachment removed - see next + 1 post
Pretty much done now: I have some feedback on the benchmark (running the code last posted):
Function Threads Blocks GPU_clocks Msamples/sec
Shared 1 1 742808 2
Shared 2 1 780344 3
Shared 32 1 1207104 32
Shared 224 1 2134804 127
Shared 192 48 6244886 1793
Shared 224 48 5075176 2574
Shared 192 64 5906718 2528
SharedL 1 1 1089384 1
SharedL 2 1 1127116 2
SharedL 32 1 1568296 25
SharedL 224 1 2494802 109
SharedL 512 16 6925714 1437
SharedL 384 32 8389778 1780
SharedL 256 48 8379398 1782
Global 1 1 1629040 1
Global 2 1 1657436 1
Global 32 1 2056432 19
Global 192 48 17189732 651
Global 224 48 17397454 751
Global 192 64 20116224 742
GlobalL 1 1 2008976 1
GlobalL 2 1 2079852 1
GlobalL 32 1 2708148 14
GlobalL 224 1 4340944 63
GlobalL 512 16 16001204 622
GlobalL 384 32 20548226 727
GlobalL 256 48 21099388 708
Wide 312 1 4475348 169
Wide 312 32 8184578 2964
Processing time: 227.964996 (ms)
This was run on pre-rel tools so they are new GPU clocks and it looks like the shared mem alloc bug has been fixed as can make 100% occupancy @ 192 threads. So the fastest is the single shot 312 thread version @ 81% occupancy. Perhaps you are better off with running fewer bigger blocks, even if this means you don’t make 100% occupancy. Odd that the shared mem version with 19248 (75% occupancy) has a longer overall run time than 19264 (100% occupancy) - must be a problem in the hardware scheduler or onboard executive? This is the code that was run on an 8800GTX:
ed: had a query and there was a typo in the test harness…
ed: I have deleted the attachment - see next post
Thought I should post an update for V1.0 and also have implemented simple verification testing since I can test now. Lots of interesting performance issues - like some configurations are faster returning data in global memory than in shared memory - strange. An 8800 Ultra should get up to 5x10^9 MT randoms/sec here… (slightly better performance in 32 bit mode).
I am surprised no-one commented upon the fact that previous posts did not separate global state memory for each block (in the global versions)- either no one has used or not tested or could not be bothered.
Perhaps someone will do SFMT for everyone. I do not need it.
This implementation generates a single sequence up to 512 threads wide that I believe is more correct to use than the MT in the SDK.
Eric
[attachment=3791:attachment]
ed: I replaced this attachment after finding elsewhere that long term timing using the clock register was unreliable. Timing now done with the host clock otherwise no change to the reference library code.
Hey all,
Just started playing with CUDA to eventually try to get my MC ray tracer ported. Now I wanted to try out the random number generators, so I downloaded your implementation and tested it. I’m using mt19937gi and mt19937gl, and I’m getting the random numbers repeated for each block. Looking at mt19937gi, it seems that each block will be initialized with the same seeds, so no wonder. What’s the proper way of initializing the generators when running several blocks?
Thanks,
/Patrik J.
It is methodologically incorrect to run the same Mersenne Twister on each parallel unit supplying just different seeds, since the outputs of the units are not guaranteed to be independent. Each parallel unit should have individual twister parameter values. DCMT library, developed by the inventors of Mersenne Twister themselves, is designed exactly for this purpose.
Ok… I don’t understand how that helps me… Are you saying that initializing different blocks with different seeds can give them the same state?
If you want to be picky, I should have said that the different blocks have the same STATE then. But my question was about HOW TO generate a correct initialization over several blocks. Since mt19937gl handles multiple blocks when generating, and there is a separate state vector for each block, I assume there’s also a way to init them correctly?
Well, as a special case, of two different MT states one can possibly be just an evolution of the other, and apriori we can hardly tell anything by just looking. However in general case “correlated” is not a synonym to “identical”. To put it simply, the whole result (combined from the block results) can be “not very random” as a single sequence, with possible “similarities” between the subsequneces formed by the block results – thus not very suitable for applications like Monte-Carlo simulation. MT19937 and all the other reference sources from Matsumoto & Nishimura are single-threaded, so don’t suffer from such problems. In order to avoid this methodological difficulty, DCMT library was used to spawn independent MT parameters for each thread in the MersenneTwister CUDA SDK sample. For more information please refer to the sample(which is ready to use) and the whitepaper.
I see what you are saying, but I still don’t understand how to do it. I looked at the MersenneTwister example, and it seems to just load the states from a precomputed file? In any case, in this code the state is kept on the stack so there’s no way to explicitly upload a state from the host. Are you saying that the code posted here is incorrect then?
(I should say that for now I’m not too concerned about the quality of the random numbers. I’m just exploring currently.)
You have to organise to call mt19937gi with different seeds for each block or call mt19937ga for more than 2^32 starting points in the series.
Yes, MT does have good independence between different parts of the series but one should not use different parts of the same series in the same MC calculation. I my particular case this was all I needed as each block was running an independent calculation.
Nvidia cheated in the SDK by precalculating the parameters for a set of small (short period) MT series then saving those parameters to a file and distributing those. Most all the CPU time in the example is required for calculating the MT parameters. One can get better performance from larger Mersenne primes but the calculation time using the DCMT library rapidly escalates (they estimate O(n^3)). My quick test showed that MT4253 was the smallest series that allows calculation of 64 results in parallel which is the smallest useful block size on G80 (this took 2 mins of CPU per set of parameters). State is only 2.3 words/thread which is similar to MT19937. The only parameters that change between different MT series for a given prime are MATRIXA and the 2 tempering masks. Perhaps someone should publish a set of these for a reasonable sized prime.
Eric
ed: Interesting config is MT2203 that allows 32 results in parallel and a syncless implementation on G80 but will take 1 1/2 hrs normal CPU to calculate the parameters for 100% occupancy (384 warps).
Well,… cheating is a bit too strong word, IMO. External Media The DCMT library is intended for such offline precalculation, simply because (just as you say) it’s extremly time-consuming; saving MT configurations for future reuse is as fraudulent as precalculating quotients for a polynomial approximation. Once the MT parameters are precalculated, the twisters can be initialized with arbitrary seeds (states) and run in parallel without correlation. :magic:
P.S. If you are interested, I can post the exact C-program that generated MersenneTwister.dat file.
In the SDK sample the states are initialized from the seed, but you can easily modify the sample to simply download the states from the host. The states are different from the parameters. Though the latter is precalculated, the former is supplied by the user, so it’s an RNG implementation in the true sense of the word.
If osiris still uses the same twister parameters for all threads/blocks, then it’s incorrect.
Yeah… but we want to see all the code that was used. That was the point! (You added that offer after I started writing)… I don’t need to see it as I presume it is based on dcmt0.4. Perhaps it should have been in the SDK. Still I like being able to see all the code I am using on one page - the SDK is a bit messy. I guess that in most applications the problem with using different parts of the same series is that you don’t know how far apart the starting points are for each seed and the real danger is repeating yourself if they are too close.
Just for interest I ran 1 set of parameters for each Mersenne prime and I estimate the computation time is exponential with p, not O(p^3). I now have a set of params for MT19937 that does allow 256 results in parallel which is of interest, except I am not using (G80) at present. MT19937 took 1 hour of CPU for 1 set.
Eric :)
Ok, I’m beginning to grasp what’s going on, thanks for patiently trying to explain! External Media I never looked into the details of the MT19937 but am I correct in thinking that there are many different possible MT19937 RNGs? So the idea is to have each thread (or each block, I’m still hazy on this point…) an actual different RNG from the others. So that even if they have the same state, they give different numbers?
In osiris’s example, you seed each block, and there is only one state per block. Does that mean that all threads in a block are running the SAME RNG, cleverly implemented to be thread-safe? I’m still confused about this part.
And yes, I would be interested in the MT19937 parameters you calculated.