Mersenne Twister Multithread

I’m interested in the mersenne twister RNG. I’d like to compare the sdk implementation with a multithreading cpu implementation (i want to use all the power of my quad core).
Do you know a ready-to-use mersenne twister rng optimized for multicore processors? Otherwise i’ll code by myself…
Thank you.

Do you think this library contains the implementation I’m looking for?

The code by Matsumoto and Saito (the authors!) is actually pretty efficient.
The newer SFMT is not quite the same as MT but is likely superior especially for the CPU.…SFMT/index.html

MT’s huge state size makes it rather inelegant to use on the GPU. Numerical quality is impeccable though!

Is SFMT a true multi-thread implementation (very good for my quad-core) or just MMX/SSEn optimized?

MT doesn’t lend itself to multithreaded evaluation very well for many reasons.
The best compromise is to run N MT’s in parallel.
Matsumoto-san made algorithms for making sure each of the MTs are independent.

It’s hard to parallelize a single MT stream since its state is iterative.

I have my own GPU PRNG design for parallel evaluation (on CPU as well) but I need to polish it up before I submit it to a journal. I use the code daily but the paper itself needs work and some pre-peer review.

So… in the SDK the results of two implementations are compared: the first runs on the GPU, the second on the CPU, right?

The CPU implementation is about 100 times slower than the GPU one, but… which version of the algorithm is used on the CPU? N MT’s in parallel or one purely sequential MT?

Thank you for your help.

I believe the SDK example just uses a single CPU thread. You wouldn’t be too wrong in assuming that 4 cores could compute 4 times as quickly if they each have their own generators.

Also be careful saying “faster” here. It’s hard to time PRNGs. Usually they are very very very fast to evaluate, and in fact the memory speed may matter most just to write the answers into memory.
For example, the similar tasks of “generate a list of 1M PR values” and “find the sum of 1M PR values” may have very different timing behaviors.

Most apps are not affected much by evaluation speed of the PRNG. What’s usually more important is quality, state size, seeding quality, register use, code size and portability, and for parallel computing, leapfrog and jump ability.

I added the measurement of the duration of the CPU implementation in the SDK code. Look at this code:

clock_t initial_CPU, final_CPU;

initial_CPU = clock();

RandomRef(h_RandCPU, N_PER_RNG, SEED);

final_CPU = clock();

float cycles = final_CPU - initial_CPU,

			   cpuTime = (float)cycles / (float)CLOCKS_PER_SEC * 1000;

printf("Generated samples (CPU) : %i \n", RAND_N);

printf("RandomRef() (CPU) time  : %f \n", cpuTime);

printf("Samples per second (CPU) : %E \n", RAND_N / (cpuTime * 0.001));

printf("RNG: Time ratio CPU_Time / GPU_Time : %f\n", cpuTime/gpuTime);

And then, here is the output whene I run the two implementation:

Initializing data for 24000000 samples...

Loading CPU and GPU twisters configurations...

Generating random numbers on GPU...

Generated samples : 24002560

RandomGPU() time  : 8.242716

Samples per second: 2.911972E+009

Applying Box-Muller transformation on GPU...

Transformed samples : 24002560

BoxMullerGPU() time : 4.496370

Samples per second  : 5.338209E+009

Reading back the results...

Checking GPU results...

...generating random numbers on CPU using reference generator

Generated samples (CPU) : 24002560

RandomRef() (CPU) time  : 453.000000

Samples per second (CPU) : 5.298578E+007

RNG: Time ratio CPU_Time / GPU_Time : 100.747940

...applying Box-Muller transformation on CPU

Transformed samples (CPU) : 24002560

BoxMullerRef() (CPU) time : 1844.000000

Samples per second (CPU) : 1.301657E+007

BoxMuller: Time ratio CPU_Time / GPU_Time : 410.108613

...comparing the results

Max absolute error: 7.186085E-006

L1 norm: 1.785960E-007


Shutting down...

What do you think about this comparison?

Consider I’m using a quad-core intel Q9300 and a dedicated Nvidia GTX 280 (with 240 stream processors).

I was actually reading up on this the other day, but hadn’t really had a chance to look into it much more…does anyone know if the Mersenne Twister SDK project uses the naive algorithm? The article I read mentioned that it is a ‘weak’ algorithm because once you have a certain number of output values, it is deterministic, and you can predict the rest of the outputs from there on out.

I understand this is why SFMT is preferable (besides the fact that it is even a faster algorithm). If the SDK hasn’t implemented SFMT yet, I might try to at some point. This way, the GPU would be more useful for cryptography-type apps, or even some scientific/numerical code that relies on having a strong RNG.

For my specific application, I don’t need a cryptography-safe RNG, but one for montecarlo simulations. Anyway Mersenne Twister is known not to be suited for criptograpgy-type apps.

hi guys!

I am also very interested in mersenne twister on cuda.

I have got some statistic problems with the one proposed by victor.

Actually, it can not pass the jarcque barra test at all, using box muller transformation.

Looking a little bit more in the results of the U(0.1) random numbers, you can see that there are not really following a U(0.1) law.

Does some of you have any ideas how to improve the results?



Have you had any luckk with this? I also am seeing problems. The numbers from the new Mersenne Twister (GPU) do not match the old MersenneTwister (CPU). And somewhere down the line, the numbers just flatline to zero.

Jonathan Scott