Bug in CURAND 4.1 curandStateMRG32k3a_t badly initialized by curand_init

Wilfried_K · March 13, 2012, 1:19pm

Hi !

I tried to use the new MRG32k3a PRNG. As there is no example on how to use it and as the CURAND docuimentation does not make any differences between XORWOW and MRG32K3a, I assumed that one just need to change the type of the curandState from curandStateXORWOW_t to curandStateMRG32k3a_t.

However, by doing this, curand_init causes unspecified launch failure.

Studying the code, I found that the scratch pointer of curandStateMRG32k3a_t is used before being set:

void curand_init(unsigned long long seed, 

	unsigned long long subsequence, 

	unsigned long long offset, 

	curandStateMRG32k3a_t *state)

{

	int i;

	for ( i=0; i<3; i++ ) {

		state->s1[i] = 12345.;

		state->s2[i] = 12345.;

	}

	if (seed != 0ull) {

		unsigned int x1 = ((unsigned int)seed) ^ 0x55555555UL;

		unsigned int x2 = (unsigned int)((seed >> 32) ^ 0xAAAAAAAAUL);

		state->s1[0] = curand_MRGmodMul(x1, state->s1[0], MRG32K3A_MOD1);

		state->s1[1] = curand_MRGmodMul(x2, state->s1[1], MRG32K3A_MOD1);

		state->s1[2] = curand_MRGmodMul(x1, state->s1[2], MRG32K3A_MOD1);			

		state->s2[0] = curand_MRGmodMul(x2, state->s2[0], MRG32K3A_MOD2);

		state->s2[1] = curand_MRGmodMul(x1, state->s2[1], MRG32K3A_MOD2);

		state->s2[2] = curand_MRGmodMul(x2, state->s2[2], MRG32K3A_MOD2);

	}

	skipahead_subsequence( subsequence, state );

	skipahead( offset, state );

}

void skipahead_subsequence(unsigned long long n, curandStateMRG32k3a_t *state)

{

	double t[3][3];

	struct sMRG32k3aSkipSubSeq * pSkip;

	pSkip = ((curandMRG32k3aPtrs_t *)state->scratch)->subSeqM1;

	curand_MRGmatPow3x3( pSkip->m, t, MRG32K3A_MOD1, n);

	curand_MRGmatVecMul3x3( t, state->s1, MRG32K3A_MOD1);

	pSkip = ((curandMRG32k3aPtrs_t *)state->scratch)->subSeqM2;

	curand_MRGmatPow3x3( pSkip->m, t, MRG32K3A_MOD2, n);

	curand_MRGmatVecMul3x3( t, state->s2, MRG32K3A_MOD2);

}

changing curand_init to

void curand_init(unsigned long long seed, 

	unsigned long long subsequence, 

	unsigned long long offset, 

	curandStateMRG32k3a_t *state)

{

	int i;

	for ( i=0; i<3; i++ ) {

		state->s1[i] = 12345.;

		state->s2[i] = 12345.;

	}

	if (seed != 0ull) {

		unsigned int x1 = ((unsigned int)seed) ^ 0x55555555UL;

		unsigned int x2 = (unsigned int)((seed >> 32) ^ 0xAAAAAAAAUL);

		state->s1[0] = curand_MRGmodMul(x1, state->s1[0], MRG32K3A_MOD1);

		state->s1[1] = curand_MRGmodMul(x2, state->s1[1], MRG32K3A_MOD1);

		state->s1[2] = curand_MRGmodMul(x1, state->s1[2], MRG32K3A_MOD1);

		state->s2[0] = curand_MRGmodMul(x2, state->s2[0], MRG32K3A_MOD2);

		state->s2[1] = curand_MRGmodMul(x1, state->s2[1], MRG32K3A_MOD2);

		state->s2[2] = curand_MRGmodMul(x2, state->s2[2], MRG32K3A_MOD2);

	}

	curandMRG32k3aPtrs_t scratch;

	sMRG32k3aSkipUnits_t unitsM1;

	sMRG32k3aSkipUnits_t unitsM2;

	sMRG32k3aSkipSubSeq_t subSeqM1;

	sMRG32k3aSkipSubSeq_t subSeqM2;

	sMRG32k3aSkipSeq_t seqM1;

	sMRG32k3aSkipSeq_t seqM2;

	scratch.seqM1 = &seqM1;

	scratch.seqM2 = &seqM2;

	scratch.subSeqM1 = &subSeqM1;

	scratch.subSeqM2 = &subSeqM2;

	scratch.unitsM2 = &unitsM2;

	scratch.unitsM1 = &unitsM1;

	state->scratch = &scratch;

	skipahead_subsequence( subsequence, state );

	skipahead( offset, state );

}

solve this problem but only partially : the data pointed by the acratch pointer are allocated but not initialized.

I suppose that scratch should be a memory space common to all the threads and hence pointed at by the curandstate of each thread. However, I do not understand how it is to be initialized.

Can anyone provide an example of usage of this PRNG ?

Thank you.

Reagrds,

Wilfried K.

njuffa · March 14, 2012, 11:05pm

Since I have not used CURAND myself, I asked the library team for comment, and they provided the following response (with minor edits by me):

You are right that our current documentation failed to describe how these structures should be initialized. Sorry for the inconvenience, we will definitely add some text and an example in the next release. Meanwhile here is an explanation.

These are arrays of pre-computed matrices used to skip ahead in the MRG32k3a sequence. They are declared in curand_mrg32k3a.h. You will need to include that header file in your code (it is already included in the curand library, but it looks like you may not be linking with that). Prior to calling the init, you will need to copy the arrays to device memory and fill in a structure of type curandMRG32k3aPtrs_t with the device pointers to the corresponding arrays, and copy that structure to device memory. The device pointer to that structure is what goes into state->scratch. From the code you included it seems you figured most of this out. The missing piece was copying the arrays from the host, and setting state->scratch before calling curand_init

Of course there are a lot of ways to code the inialization. Below is an example. It allocates space for all of the structures at once, and then copies the arrays one at a time to device memory. (this was necessary because not all systems keep the arrays in memory in the order they are declared) It sets gpu_scratch_area to the device pointer to the curandMRG32k3aPtrs_t structure, and it keeps a pointer to the entire allocation in MRGtables, so that the memory can be released with a cudaFree later on.

curandMRG32k3aPtrs_t MRG;

        curandMRG32k3aPtrs_t gpu_scratch_area;

        int * MRGTables

if(cudaMalloc((void **)&(MRG.unitsM1),

                        (2 * sizeof(struct sMRG32k3aSkipUnits)) + 

                        (2 * sizeof(struct sMRG32k3aSkipSubSeq)) + 

                        (2 * sizeof(struct sMRG32k3aSkipSeq)) +

                        sizeof(curandMRG32k3aPtrs_t) )!= cudaSuccess)

        {

            return CURAND_STATUS_ALLOCATION_FAILED;

        }

        if(cudaMemcpy(MRG.unitsM1,

                      ((char *)(&mrg32k3aM1[0][0][0])),

                      sizeof( struct sMRG32k3aSkipUnits ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        MRG.unitsM2 = (sMRG32k3aSkipUnits_t *)((char *)MRG.unitsM1 + 

                        sizeof( struct sMRG32k3aSkipUnits ));

        if(cudaMemcpy(MRG.unitsM2, 

                      &mrg32k3aM2[0][0][0],

                      sizeof( struct sMRG32k3aSkipUnits ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        MRG.subSeqM1 = (sMRG32k3aSkipSubSeq_t *)((char *)MRG.unitsM2 + 

                        sizeof( struct sMRG32k3aSkipUnits ));

        if(cudaMemcpy(MRG.subSeqM1, 

                      &mrg32k3aM1SubSeq[0][0][0],

                      sizeof( struct sMRG32k3aSkipSubSeq ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        MRG.subSeqM2 = (sMRG32k3aSkipSubSeq_t *)((char *)MRG.subSeqM1 + 

                        sizeof( struct sMRG32k3aSkipSubSeq ));

        if(cudaMemcpy(MRG.subSeqM2, 

                      &mrg32k3aM2SubSeq[0][0][0],

                      sizeof( struct sMRG32k3aSkipSubSeq ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        MRG.seqM1 = (sMRG32k3aSkipSeq_t *)((char *)MRG.subSeqM2 + 

                     sizeof( struct sMRG32k3aSkipSubSeq ));

        if(cudaMemcpy(MRG.seqM1, 

                      &mrg32k3aM1Seq[0][0][0],

                      sizeof( struct sMRG32k3aSkipSeq ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        MRG.seqM2 = (sMRG32k3aSkipSeq_t *)((char *)MRG.seqM1 + 

                     sizeof( struct sMRG32k3aSkipSeq ));

        if(cudaMemcpy(MRG.seqM2, 

                      &mrg32k3aM2Seq[0][0][0],

                      sizeof( struct sMRG32k3aSkipSeq ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        gpu_scratch_area = (unsigned int *)((char *)MRG.seqM2 + 

                               sizeof( struct sMRG32k3aSkipSeq ));

        if(cudaMemcpy(gpu_scratch_area, 

                      &MRG,

                      sizeof( curandMRG32k3aPtrs_t ), 

                      cudaMemcpyHostToDevice) != cudaSuccess)

        {

            cudaFree(MRG.unitsM1);

            return CURAND_STATUS_INITIALIZATION_FAILED;

        }

        MRGTables = (int *)MRG.unitsM1;

Wilfried_K · March 15, 2012, 9:46am

Thank you for your answer !

This header is not included in the curand headers. If I compare with the design of the other PRNGs, it should have been included in the file curand_kernel.h.

Is there a reason to use another solution compared with the XORWOW PRNG?

The XORWOW PRNG precomputed arrays are declared as constant tables in curand_precalc.h.

This would ease the use of this PRNG, provide an unified initialization compared to the XORWOW and it would reduce the size of the curandState by avoiding a pointer.

I think that providing a single usage for all the generators is a good objective, especially for PRNG that may be changes frequently.

Wilfried_K · March 15, 2012, 10:29am