memory error with using PRNG MRG32k3a on fermin 2050

My code crashed with " unspecified launch failure". I did memcheck and something unexpected happened in intializing the random number generator.


========= Invalid __global__ read of size 8

=========     at 0x00000a28 in kernel_curand_init

=========     by thread (0,1,0) in block (6,0,0)

=========     Address 0xdddddddddddddded is misaligned


========= Invalid __global__ read of size 8

=========     at 0x00000a28 in kernel_curand_init

=========     by thread (1,1,0) in block (6,0,0)

=========     Address 0xdddddddddddddded is misaligned




========= ERROR SUMMARY: 32 errors

the code works fine if I use XORWOW random generator. Does anybody have experience in MRG32k3a type of generators?

Following is a simplified code which reproduce memory problem.

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <curand.h>

#include <curand_kernel.h>

#include <curand_mrg32k3a.h>

#define CHECK_CUDA_ERR(str) checkCudaError(str)

#define NODE 512

void checkCudaError(const char *msg)


  cudaError_t err = cudaGetLastError();

  if( cudaSuccess != err) {

    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );




__global__ void kernel_curand_init(curandStateMRG32k3a *rstate)


    int id=threadIdx.x + blockIdx.x*blockDim.x;

    curand_init(1234, id, 0, &rstate[id]);



int main(int argc, char *argv[]) {

    float2 *G;

    cudaMalloc((void **) &G, sizeof(float2)*NODE);

curandStateMRG32k3a *rstate;

    cudaMalloc((void **) &rstate, sizeof(curandStateMRG32k3a)*NODE);


int nt = 256;

    int nb = (NODE+nt-1)/nt;

    kernel_curand_init <<<nb, nt>>> (rstate);




    return 0;


I haven’t used CURAND but this looks incorrect to me:

cudaMalloc((void **) &rstate, sizeof(curandStateMRG32k3a)*NODE);

sizeof(curandStateMRG32k3a) gives the size of a pointer to the PRNG state, not the size of the PRNG state which I assume is what we want. Try:

cudaMalloc((void **) &rstate, sizeof(curandStateMRG32k3a[0])*NODE);

BTW, in this case it would seem to be harmless (because “NODE” happens to be evenly divided by “nt”), but in general you would want to check whether the index “id” is less than “NODE” to prevent array accesses out of bounds:

if (id < NODE) {

       curand_init(1234, id, 0, &rstate[id]);


Thanks for your respond, but

cudaMalloc((void **) &rstate, sizeof(curandStateMRG32k3a)*NODE);

seems alright to me, I am declaring memory for an array of type curandStateMRG32k3a,

which is a structure defined in curand.h and is not a pointer. BTW, sizeof(curandStateMRG32k3a)=80

and thanks for the suggestion of checking bound in kernel. I forgot to include that in this snippet.

I am very sorry for the confusion, I was going too fast and mixed up rstate and curandStateMRG32k3a. What I meant to suggest was making the code self-consistent:

cudaMalloc((void **) &rstate, sizeof(rstate[0])*NODE);

You are correct that this should behave the same as

cudaMalloc((void **) &rstate, sizeof(curandStateMRG32k3a)*NODE);

As I stated, I haven’t used CURAND so far and am not familiar with its setup requirements. I assume you have double checked the call to curand_init() to make sure all arguments are passed exactly as required? What version of CUDA are you using?

Hi lattice,

I can not reproduce a crush with your test case. Can you provide detail info about your test environment? Include OS, driver version, toolkit verion and card type. Thanks!

Best regards!

I was able to reproduce the crash (and cuda-memcheck complaints) with the posted code on a system with C2050, Linux64, CUDA 4.2. After reading CURAND 4.2 documentation, it seems the problem is with the use of curandStateMRG32k3a. Under “Device API Overview” on page 13 of the documentation, in sub-section “Bit Generation with XORWOW and MRG32k3a generators” the manual clearly shows that cuRandState_t should be used to store the PRNG state. If I change the code accordingly, the crash and cuda-memcheck complaints disappear. I am attaching the code I used in a file for reference. I compiled it as follows:

nvcc -arch=sm_20 -o rngtest

Let me know if switching from curandStateMRG32k3a to cuRandState_t does not fix the problem. (2.61 KB)

Hi njuffa.

I was using curandState before and my program run without problem. But I want to try MRG32k3a and in the section that you referred, the manual does not describe how to choose between XORWOW and MRG32k3a. Under Device API --> Typedefs on page 55 of the documentation, it shows that curandState_t defined as structure curandStateXORWOW, and curandStateMRG32k3a_t is defined as structure curandStateMRG32k3a. I suppose they are different type of generators, and if I want to use MRG32k3a, I should use curandStateMRG32k3a in my code.

I am doing large scale Monte Carlo simulations. The quality of the generators are important to me and I would like to test with different types of generators.

I agree the documentation seems to be worthy of improvement in this regard, I cannot figure it out either from looking at the document. It seems the CURAND initialization functions for the device API are overloaded for each type of RNG, because towards the end of the document I find this:

device void curand_init (unsigned long long seed, unsigned long long subsequence, unsigned long long offset, curandStateMRG32k3a_t state)

Using that would get us back to your starting point and the crash. I will do some more digging. Sorry for the inconvenience. BTW, I assume you already checked whether there is a worked example of device-side use of MRG32k3a somewhere in the SDK?

I did not find such examples in SDK.

After digging some more, I still cannot figure out how to make MRG32k3a work on the device. Please file a bug, that will be the swiftest way to get the issue resolved. Again, sorry for the inconvenience, and thank you for your help.