curand_init - Grid Launch Error

Hi,

I am getting a grid launch error with the following code under CUDA 8.0 compiled with compute_50,sm_50. Computer is a Microsoft Surface Pro which has a Geforce GPU with CUDA capability 5.0 running Windows 10.

Error message is “CUDA grid launch failed: CUcontext: 55461112 CUmodule: 55268960 Function: _Z12setup_kernelP19curandStateMRG32k3am”. If I comment out the call to curand_init then it launches fine. Any ideas? The driver is up to date.

#include <cuda_runtime.h>
#include <curand_kernel.h>
#include <curand.h>
#include "device_launch_parameters.h"
#include <stdio.h>

__global__ void setup_kernel(curandStateMRG32k3a * state, unsigned long seed)
{
	const unsigned tid = threadIdx.x;
	const unsigned bid = blockIdx.x;
	const unsigned bsz = blockDim.x;
	int index = tid + bid * bsz;
	curand_init(seed, index, 0, &state[index]);
}

int main()
{
	int minNumSims = 500;
	const unsigned BLOCK_SIZE = 256;
	const unsigned GRID_SIZE = (const unsigned)ceil(float(minNumSims) / float(BLOCK_SIZE));
	int numSims = BLOCK_SIZE * GRID_SIZE;
	dim3 tpb(numSims, 1, 1);
	curandStateMRG32k3a *devStates;

	cudaError_t err = cudaMalloc((void **)&devStates, BLOCK_SIZE * GRID_SIZE * sizeof(curandStateMRG32k3a));
	setup_kernel << <  GRID_SIZE, BLOCK_SIZE >> > (devStates, 1);

    return 0;
}

An update - the above code works fine in CUDA 7.5. It’s only a CUDA 8.0 issue. My GPU is Maxwell vintage.

Any ideas? I will file a bug report I think…

run it using cuda-memcheck to find the exact reason that the launch failed. I’m not able to readily reproduce the error, but I don’t have your exact setup.

I’m suspicious that this may be a registers-per-thread issue, which is not a bug per-se. cuda-memcheck will indicate the problem if it is that.

You can also test this by launching a much smaller number of threads, say 1 block of 32 threads, rather than 2 blocks of 256 threads.

Tried running through cuda-memcheck and it finds no errors.

Also tried with BLOCK_SIZE = 32 and minNumSims = 32, which means 1 block of 32 threads. Same error, again it is fine in CUDA 7.5 but fails in 8.0

So the error that was occurring does not occur when you run it with cuda-memcheck?

Correct, cuda-memcheck says everything is fine, but nsight throws the error I reported above.

It might be an issue with the nsight environment, and not actually an issue with the code itself.