Bit Generation with the MTGP32 generator

Hello,
I would like to generate pseudorandom numbers using Mersenne Twister algorithm with my geforce 820m (CUDA release 5.5). I’ve tried with curand methods, but they don’t work; then I implemented the algorithm from the beginning. So I have two questions for you.

First. My algorithm works very well, while I call only one block. The reason is that blocks are not synchronized. Infact I’ve tried to run the program, initializing all the processes with the same seed and generating one number per thread. The result is that each thread inside the same block produces the same number. Changing the block, threads produce the same number, but this is different from the number produced by threads of the first block.
Can I synchronize blocks?
Probably this is an hopeless question, infact I’ve read online that blocks must work independently.
Can you, please, explain to me why?
Probably my geforce 820m is a little bit old, please let me know if there are methods in the newer releases.

Second. Is there any example of Bit Generation with the MTGP32 generator?
I have issues initializing mtgp32_params_fast.

Please don’t link me Hiroshima Implementation.
I’m not a professional, but I’m looking for one of the two previous solutions.
I hope my questions are not a waste of time.

Thanks a lot,
Andrea

Don’t work how? Is there a particular reason you are using an ancient version of CUDA? GeForce 820M is a device with compute capability 2.1 which is supported up to and including CUDA 8, which was just recently superseded by CUDA 9.

As you noted, in the CUDA execution model each thread blocks is indeed designed to execute independently of other thread blocks. CURAND therefore provides PRNGs that generate multiple independent streams of random numbers. If your application requires inter-block synchronization it is either not suitable for the GPU or (more likely, IMHO) your specific mapping of the task to the GPU is unsuitable. The fact that thread blocks are executed independently provides for (1) efficient hardware implementation (2) effortless scalability from the smallest GPU (e.g. integrated into Tegra) to top-of-the-line professional solutions like Tesla.