Hi everyone,
I an pretty new to CUDA programming, but I need to expertise CUDA and GPU architecture.
These days I am working with Random Number generation with Mersenne Twister. I have some doubts on this implementation.

How many random numbers are generating from this by default. Initially I thought it is 4096, So I commented the BoXMuller and printed the returned array from GPU. There were over 100000 of PRN there. Please comment on this, I really appreciate.

Why RandomGPU<<<32, 128>>>(d_Rand, N_PER_RNG); is calling inside a forloop. it is getting executed 10 times.

How to generate desired number of Random numbers from this. I need to measure the running time for various configurations.
For example,
of PRN time
4096 t1
24096 t2
34096 t3
Thank You