I am looking for some help on a most curious and frustrating problem. First, some facts.
- Tesla C2050
- Microsoft Visual C++ 2010 Express on a Windows 7 OS
- CUDA 5.0
The problem concerns the MTGP32 random number generator provided by the CURAND library in CUDA 5.0. It does not occur with the XORWOW RNG.
Problem: The RNG fails to produce a 3rd random number in a loop whose conditional statement evaluates the random number.
The example code below demonstrates this problem. In the kernel, each thread continues to generate a random number between 0 & 1 (inclusively) until the number is over 0.5. Some threads produce a valid number on their 1st try; others succeed on their 2nd try. The problem occurs for threads which fail twice; on their 3rd and all subsequent iterations through the loop, they do not produce a new random number (ie update the state of the RNG). Based on the terribly confusing CURAND manual (http://docs.nvidia.com/cuda/curand/index.html#topic_1_3_1), my suspicion lies with thread divergence. This RNG may only work when all threads in the block call the curand device function.
Below, I have included code, in its entirety, which produces this problem. The seed for the RNG and one “rouge” thread have been pre-set. Please feel free to download & manipulate the code. The output from the code follows.
#include <stdio.h>
#include <iostream>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand_mtgp32_host.h>
#include <curand_mtgp32dc_p_11213.h>
using namespace std;
#define NBLOCKS 100
#define NTHREADS 128
__global__ void ProblemKernel(curandStateMtgp32 *States) {
int tid = threadIdx.x + blockIdx.x*blockDim.x;
float ranNum = 0.0f;
for(int i=0; (i<10) && (ranNum < 0.5f); i++) {
ranNum = curand(&States[blockIdx.x])*(1.0/4294967295.0);
if(tid==14) { printf("i = %d\t%g\n", i, ranNum); }
}
}
int main (void) {
unsigned long long seedA = 3080548152;
curandStateMtgp32 *devStates;
mtgp32_kernel_params *devKernelParams;
cudaMalloc( (void **) &devStates, NBLOCKS*sizeof(curandStateMtgp32) );
cudaMalloc( (void **) &devKernelParams, sizeof(mtgp32_kernel_params) );
curandMakeMTGP32Constants( mtgp32dc_params_fast_11213, devKernelParams );
curandMakeMTGP32KernelState( devStates, mtgp32dc_params_fast_11213, devKernelParams, NBLOCKS, seedA );
ProblemKernel<<<NBLOCKS, NTHREADS>>>(devStates);
cudaDeviceSynchronize();
cudaFree(devStates);
cudaFree(devKernelParams);
cout << endl;
system("pause");
return 0;
}
The Output
i = 0 0.23747
i = 1 0.167179
i = 2 0.167179
i = 3 0.167179
i = 4 0.167179
i = 5 0.167179
i = 6 0.167179
i = 7 0.167179
i = 8 0.167179
i = 9 0.167179
Press any key to continue . . .