CURAND initialization time

I just started using the new CURAND library.

And I did some testing/benchmarking.

I tested the startup time with a simple setup kernel:

__global__ void setup_kernel(curandState *state)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

        int id = x + y * blockDim.x * gridDim.x;

	curand_init(SEED, id, 0, &state[id]);

}

Then I tested the generation time with simple noise pixel generator:

__global__ void noise2d(unsigned char *ptr, int ticks, curandState *state)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

        int id = x + y * blockDim.x * gridDim.x;

	curandState localState = state[id];

        ptr[id*4 + 0] = 255*curand_uniform(&localState);

        ptr[id*4 + 1] = 255*curand_uniform(&localState);

        ptr[id*4 + 2] = 255*curand_uniform(&localState);

        ptr[id*4 + 3] = 255;

 	state[id] = localState;

}

So the results for execution of curand_uniform (and also other curand_x)

are quite good ~2ms.

But creating random generators doesn’t look so pretty.

When I launched 1024*1024 threads, and started creating generators it took

about 18000ms (18 seconds) !

So my questions are:

    what are your times of creating generators ?

    is there a faster way to do it ?

    should I create only generators for blocks and somehow reuse/replicate them ?

I just started using the new CURAND library.

And I did some testing/benchmarking.

I tested the startup time with a simple setup kernel:

__global__ void setup_kernel(curandState *state)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

        int id = x + y * blockDim.x * gridDim.x;

	curand_init(SEED, id, 0, &state[id]);

}

Then I tested the generation time with simple noise pixel generator:

__global__ void noise2d(unsigned char *ptr, int ticks, curandState *state)

{

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

        int id = x + y * blockDim.x * gridDim.x;

	curandState localState = state[id];

        ptr[id*4 + 0] = 255*curand_uniform(&localState);

        ptr[id*4 + 1] = 255*curand_uniform(&localState);

        ptr[id*4 + 2] = 255*curand_uniform(&localState);

        ptr[id*4 + 3] = 255;

 	state[id] = localState;

}

So the results for execution of curand_uniform (and also other curand_x)

are quite good ~2ms.

But creating random generators doesn’t look so pretty.

When I launched 1024*1024 threads, and started creating generators it took

about 18000ms (18 seconds) !

So my questions are:

    what are your times of creating generators ?

    is there a faster way to do it ?

    should I create only generators for blocks and somehow reuse/replicate them ?

One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0. In the code this could be:

curand_init( (SEED << 20) + id, 0, 0, &state[id]);

That should be much faster to initialize. The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don’t know of any problems like this, and even if they do exist they will most likely be rare.

Hope this helps.

One thing you can do is use different seeds for each thread and a fixed subsequence of 0 and offset of 0. In the code this could be:

curand_init( (SEED << 20) + id, 0, 0, &state[id]);

That should be much faster to initialize. The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don’t know of any problems like this, and even if they do exist they will most likely be rare.

Hope this helps.

Thank you for your reply !

This did help.

Now I`m getting ~20ms setup time :)

Also the generated noise looks pretty the same as before.

Thank you for your reply !

This did help.

Now I`m getting ~20ms setup time :)

Also the generated noise looks pretty the same as before.

Is there any way I can generate two random numbers in 1 thread?

You can call curand_uniform twice (or any number of times) in one thread.

study the examples in the documentation:

https://docs.nvidia.com/cuda/curand/device-api-overview.html#device-api-example

Yes I figured that out soon after posting, thanks a lot @Robert_Crovella