I tested the startup time with a simple setup kernel:
__global__ void setup_kernel(curandState *state)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int id = x + y * blockDim.x * gridDim.x;
curand_init(SEED, id, 0, &state[id]);
}
Then I tested the generation time with simple noise pixel generator:
__global__ void noise2d(unsigned char *ptr, int ticks, curandState *state)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int id = x + y * blockDim.x * gridDim.x;
curandState localState = state[id];
ptr[id*4 + 0] = 255*curand_uniform(&localState);
ptr[id*4 + 1] = 255*curand_uniform(&localState);
ptr[id*4 + 2] = 255*curand_uniform(&localState);
ptr[id*4 + 3] = 255;
state[id] = localState;
}
So the results for execution of curand_uniform (and also other curand_x)
are quite good ~2ms.
But creating random generators doesn’t look so pretty.
When I launched 1024*1024 threads, and started creating generators it took
about 18000ms (18 seconds) !
So my questions are:
[*]what are your times of creating generators ?
[*]is there a faster way to do it ?
[*]should I create only generators for blocks and somehow reuse/replicate them ?
I tested the startup time with a simple setup kernel:
__global__ void setup_kernel(curandState *state)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int id = x + y * blockDim.x * gridDim.x;
curand_init(SEED, id, 0, &state[id]);
}
Then I tested the generation time with simple noise pixel generator:
__global__ void noise2d(unsigned char *ptr, int ticks, curandState *state)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int id = x + y * blockDim.x * gridDim.x;
curandState localState = state[id];
ptr[id*4 + 0] = 255*curand_uniform(&localState);
ptr[id*4 + 1] = 255*curand_uniform(&localState);
ptr[id*4 + 2] = 255*curand_uniform(&localState);
ptr[id*4 + 3] = 255;
state[id] = localState;
}
So the results for execution of curand_uniform (and also other curand_x)
are quite good ~2ms.
But creating random generators doesn’t look so pretty.
When I launched 1024*1024 threads, and started creating generators it took
about 18000ms (18 seconds) !
So my questions are:
[*]what are your times of creating generators ?
[*]is there a faster way to do it ?
[*]should I create only generators for blocks and somehow reuse/replicate them ?
That should be much faster to initialize. The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don’t know of any problems like this, and even if they do exist they will most likely be rare.
That should be much faster to initialize. The downside is that you lose some of the nice mathematical properties between threads. It is possible that there is a bad interaction between the hash function that initializes the generator state from the seed and the periodicity of the generators. If that happens, you might get two threads with highly correlated outputs for some seeds. I don’t know of any problems like this, and even if they do exist they will most likely be rare.