Mersenne Twister on Multiple GPUs

I have ported the Mersenne Twister SDK to multiple GPUs, and I notice a huge performance difference w.r.t CPUs. I find it difficult to test the quality of numbers produced, so I use something like the visual analysis explained in to make the testing easy. Following is my graph of Mersenne Twister on 1/2 gpus vs 24 cpus. For parallelizing the CPU implementation, I am initializing mersenne twister in each thread by calling :

init_genrand(time(NULL) ^ omp_get_thread_num());

…so that independent streams could be generated. The plot shows the huge difference that I am getting, I am trying to understand any accidental ‘cheating’ in my code that is causing such a difference.