[cuRand] curandStatePhilox4_32_10_t error during init

Hello,
I am having issues with setting up the curandStatePhilox4_32_10_t state on the device. I am using the following simple example (more or less from the docu):

#include <iostream>
#include <curand_kernel.h>
#include <cuda_runtime.h>

__global__ void setupRngKernel(curandStatePhilox4_32_10_t* state, int maxVal, unsigned long long seed, int offset)
{
	int id = threadIdx.x + blockDim.x * blockIdx.x + offset;
	if(id >= maxVal) 
		  return;
	 curand_init(seed, id, 0, &state[id]);
}

int main(){
	const int size = 256;
	const unsigned long long seed = 1234ull;
	std::cout << "Creating " << size << " GPU RNG states with " << sizeof(curandStatePhilox4_32_10_t) 
			  << " bytes each..." << std::endl;

	curandStatePhilox4_32_10_t* dev_randState;
	cudaSafeCall(cudaMalloc((void**)&dev_randState, size * sizeof(curandStatePhilox4_32_10_t)));
	const int useBlocks = 2;
	setupRngKernel<<<useBlocks,size>>>( dev_randState, size, seed, 0);
	cudaSafeCall(cudaFree(dev_randState));

	return 0;
}

Using CUDA 6.0 this will fail when I run the memchecker from latest Parallel Nsight saying:

Internal debugger error occurred while attempting to launch _Z14setupRngKernelP24curandStatePhilox4_32_10iyi in CUcontext 0x001a6690, CUmodule 0x042d47a0:
code patching failed due to lack of code patching memory.
Please increase Nsight|Options|CUDA|Code Patching Memory and try again.
All breakpoints for function _Z14setupRngKernelP24curandStatePhilox4_32_10iyi have been removed.
See Output View for additional messages of this type.

I tried setting code patching memory factor to sth. like 1000, but did not do any good.
Any help would be great.
Tested on: Win7 x64 using GTX780 and GTX620
Note that the above code works perfectly fine (which is to be expected) when using the (default) curandState.

Thanks
–Tobi

This is actually not an answer to your question but a suggestion from my own experience with Philox:

I am not sure if the way the PhiloxRNG is implemented in curand fits the design principle of the RNG. If you read the original paper “Parallel Random Numbers: As Easy as 1, 2, 3” you will find out that this RNG is counter-based (not state-based like “standard” RNGs). If you define a state that has to be loaded from memory you loose the benefit that this RNG has, i.e. no need to load a state…

I tried the curand performance of philox (in the first release where it was included) against the original one from the authors homepage. With the latter, performance was much better.

If you are interested I can give you an example code.

Side remark: In counter-based PRNGs, the counter is the state.

Yes, but there is no need to save the state (the counter). One can use the for example the threadID as counter.

As far as I understood the curand way for philox, when I tried this several month ago, the state needs to be saved in global memory. I did not analyze this in detail since I had included the authors’ version in my code already…

Unless I am missing something fundamental, for threads to produce independent non-overlapping sequences of random numbers in parallel requires that some per-thread state is kept. For example, a per-thread counter that is initialized with the thread ID and incremented by the total thread count.

The following is the example from the authors documentation:

#include <Random123/philox.h>

typedef r123::Philox4x32 RNG;
RNG rng;
RNG::ctr_type c={{}};
RNG::ukey_type uk={{}};
uk[0] = ???; // some user_supplied_seed
RNG::key_type k=uk;

for(...){
   c[0] = ???; // some loop-dependent application variable 
   c[1] = ???; // another loop-dependent application variable 
   RNG::ctr_type r = rng(c, k);
   // use the random values in r for some operation related to
   // this iteration on objectid
}

They use the key for setting a seed and then you have two (or actually 4 32bit) counters where you can use one with the threadID and a second with a per thread counter for example (and a third with a global counter).

That is how I understand it.

The per thread counter will be necessary (obviously) if you intend to produce more than one generated value per thread, and constitutes “state”, as it must be maintained by the thread (stored, incremented, etc.), and passed to the RNG on each generation. The threadID by itself is insufficient if you want more than one generation per thread.

By the way, the initial report in this thread looks more like a bug report against nsight VSE than any actual issue with Philox in curand.

Thank you for the hints.
I indeed need more than one random number per thread (raytracing diffuse surfaces). For now I am using the default XORWOW engine, which works, but nevertheless I was curious what the issue with the philox state might be.

Followup question: What is the most efficient way - memory wise - to generate RNGs on the GPU (do not need to be deterministic, but do not have to be correlated) when using multiple rands per thread?

The size of the curandState is 48 bytes (assuming one state per thread) on a x86 system, which might be a problem, if you need a reasonable number (10^7 or more) of states (which I do) on older cards.

My RNG init code is pretty simple and works, although I am not satisfied (everything below <2 secs is fine). I already discarded the idea of “same seed different sequence” from the docu since it takes AGES for reasonable state arrays.

__global__ void setupRngKernel(curandState* state, int maxVal, unsigned long long seed, int offset) {
int id = threadIdx.x + (blockIdx.x  + blockIdx.y * gridDim.x) * blockDim.x + offset;
if(id >= maxVal)   return;
// mod 257 (arbitrary number), because huge offsets take too long
curand_init((unsigned long long)id+seed, 0, id % 257, &state[id]);
}