Unresolved Problem with built-in Mersenne Twister (MTGP32) RNG on CUDA 5.0

I am looking for some help on a most curious and frustrating problem. First, some facts.

  1. Tesla C2050
  2. Microsoft Visual C++ 2010 Express on a Windows 7 OS
  3. CUDA 5.0

The problem concerns the MTGP32 random number generator provided by the CURAND library in CUDA 5.0. It does not occur with the XORWOW RNG.

Problem: The RNG fails to produce a 3rd random number in a loop whose conditional statement evaluates the random number.

The example code below demonstrates this problem. In the kernel, each thread continues to generate a random number between 0 & 1 (inclusively) until the number is over 0.5. Some threads produce a valid number on their 1st try; others succeed on their 2nd try. The problem occurs for threads which fail twice; on their 3rd and all subsequent iterations through the loop, they do not produce a new random number (ie update the state of the RNG). Based on the terribly confusing CURAND manual (http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_1), my suspicion lies with thread divergence. This RNG may only work when all threads in the block call the curand device function.

Below, I have included code, in its entirety, which produces this problem. The seed for the RNG and one “rouge” thread have been pre-set. Please feel free to download & manipulate the code. The output from the code follows.

#include <stdio.h>
#include <iostream>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand_mtgp32_host.h>
#include <curand_mtgp32dc_p_11213.h>

using namespace std;

#define NBLOCKS 100
#define NTHREADS 128

__global__ void ProblemKernel(curandStateMtgp32 *States) {
	int tid = threadIdx.x + blockIdx.x*blockDim.x;
	float ranNum = 0.0f;

	for(int i=0; (i<10) && (ranNum < 0.5f); i++) {
		ranNum = curand(&States[blockIdx.x])*(1.0/4294967295.0);

		if(tid==14) {	printf("i = %d\t%g\n", i, ranNum);	}
	}
}

int main (void) {
	unsigned long long seedA = 3080548152;
	curandStateMtgp32 *devStates;
	mtgp32_kernel_params *devKernelParams;

	cudaMalloc( (void **) &devStates, NBLOCKS*sizeof(curandStateMtgp32) );
	cudaMalloc( (void **) &devKernelParams, sizeof(mtgp32_kernel_params) );

	curandMakeMTGP32Constants( mtgp32dc_params_fast_11213, devKernelParams );
	curandMakeMTGP32KernelState( devStates, mtgp32dc_params_fast_11213, devKernelParams, NBLOCKS, seedA );

	ProblemKernel<<<NBLOCKS, NTHREADS>>>(devStates);
	cudaDeviceSynchronize();

	cudaFree(devStates);
	cudaFree(devKernelParams);

	cout << endl;
	system("pause");
	return 0;
}

The Output

i = 0   0.23747
i = 1   0.167179
i = 2   0.167179
i = 3   0.167179
i = 4   0.167179
i = 5   0.167179
i = 6   0.167179
i = 7   0.167179
i = 8   0.167179
i = 9   0.167179

Press any key to continue . . .