How to use the cuRand library on multiple GPUs?

Hi,

to generate normal random variables on multiple GPUs, I was thinking about allocating generators for each device and then use cudaSetDevice to switch between the indivial Gpus, i.e,

int nDevices;
CUDA_CALL(cudaGetDeviceCount(&nDevices));

int n = 10; // Number of random variables per device
float **devData = (float**)malloc(sizeof(float*)*nDevices);
curandGenerator_t *gen = (curandGenerator_t*) malloc( sizeof(curandGenerator_t)*nDevices );
for( int device = 0; device<nDevices; device++ )
{
  CUDA_CALL( cudaSetDevice( device ) );
  CURAND_CALL( curandCreateGenerator( &gen[device], CURAND_RNG_PSEUDO_DEFAULT ) );
  CURAND_CALL( curandSetPseudoRandomGeneratorSeed( gen[device], 1234ULL+device ) );
  CUDA_CALL( cudaMalloc( (void**)&devData[device] , n*sizeof( float ) ) );
  CURAND_CALL( curandGenerateNormal( gen[device], devData[device], n, 0.0f, 1.0f ) );
}

The problem is that the call to curandGenerateNormal exits with error 201, kernel launch failure, and consistently so. Anyone out there who knows what’s wrong here?

Many thanks in advance!