Confused about CURAND Sobol generator.

I am trying to use Sobol sequence with CUDA.

And I think the documentation on Sobol is kinda lacking.
I have a ray tracer, and for each pixel I need certain amount of samples.

And I want to generate them, for each pixel, in parallel.
So I need scrambled sequence.

I can call curandGetScrambleConstants32() to get a number of constants. But I dont know exactly how many I will get. And I don’t know whether these constant can be used for the same curandState or not?

Well first I agree we should at least add an example of how to use the quasi-random generators in our documentation. We’ll work on that. Meanwhile I’ll try to elaborate a bit on what is there.

Down to the specific questions. curandGetDirectionVectors32 returns a host pointer to an array of 20000 sets of direction vectors, each vector being an array of 32 unsigned 32-bit integers, one for each of the 20000 Sobol dimensions CURAND supports. Each set generates a unique sequence of random numbers. curandGetScrambleConstants32 returns a host pointer to an array of 20000 32-bit integers, which are the corresponding scramble constants for the 20000 sets of direction vectors. (The n-th constant should be used with the direction vectors of the n-th dimension) It’s up to the application to decide how many and which dimensions it needs, and to copy the direction vectors (and constants for scrambled Sobol) to the device.

When using quasi-random generators you have to be careful to use a different dimension of the generator for each independent variable you want to generate in a given thread. You can share dimensions among threads, and have each thread operate on a different subsequence of the overall sequence, by using different offsets when you initialize the thread. You set this up when you specify direction vectors, constants, and offsets in the call to curand_init.

Depending on your specific application, you might be able to give each thread its own dimension for each independent random variable it needs. For example if you were trying to populate 1024 unit cubes with uniformly distributed random points having coordinates x,y and z, using one thread per cube, you could use 3072 dimensions and have thread 0 use dimensions 0,1,2 thread 1 use dimension 3,4,5, etc. An alternative would be to just use 3 dimensions and within each dimension have thread 0 start at offset 0, thread 1 start at offset 222, thread n start at offset n * (222)…. The approach you take is a matter of how many dimensions you need and how many independent draws you need to take from the sequence.

Could you please give a small example of how to initialize a Sobol64 generator to generate uniform double precision random numbers for use on a device (to be able to generate and utilize random numbers in each thread). Thank you.

I’m working on a sample for our documentation, which I will include here in a couple of pieces(web page limitation). Let me know if it is helpful… Part 1:


  • This program uses the device CURAND API to calculate what
  • proportion of quasi-random 3D points fall within a sphere
  • of radius 1, and to derive the volume of the sphere.
    #include <stdio.h>
    #include <stdlib.h>
    #include <cuda.h>
    #include <curand_kernel.h>
    include direction vector arrays */
    #include “sobol_direction_vectors.h”

#define CUDA_CALL(x) do { if((x) != cudaSuccess) {
printf(“Error at %s:%d\n”,FILE,LINE);
return EXIT_FAILURE;}} while(0)

#define CURAND_CALL(x) do { if((x) != CURAND_STATUS_SUCCESS) {
printf(“Error at %s:%d\n”,FILE,LINE);
return EXIT_FAILURE;}} while(0)

/* This kernel initializes state per thread for each of x, y, and z */

global void setup_kernel(unsigned long long * sobolDirectionVectors,
unsigned long long sobolScrambleConstants,
curandStateScrambledSobol64 state)
int id = threadIdx.x + blockIdx.x * 64;
int dim = 3
Each thread uses 3 different dimensions /
curand_init(sobolDirectionVectors + 64
dim, sobolScrambleConstants[dim], 1234, &state[dim]);
curand_init(sobolDirectionVectors + 64*(dim + 1), sobolScrambleConstants[dim + 1], 1234, &state[dim + 1]);
curand_init(sobolDirectionVectors + 64*(dim + 2), sobolScrambleConstants[dim + 2], 1234, &state[dim + 2]);

/* This kernel generates random 3D points and increments a counter if

  • a point is within a unit sphere
    global void generate_kernel(curandStateScrambledSobol64 *state,
    int n,
    long long *result)
    int id = threadIdx.x + blockIdx.x * 64;
    int baseDim = 3 * id;
    long long count = 0;
    double x, y, z;

    /* Generate quasi-random double precision coordinates /
    for(int i = 0; i < n; i++) {
    x = curand_uniform_double(&state[baseDim]);
    y = curand_uniform_double(&state[baseDim + 1]);
    z = curand_uniform_double(&state[baseDim + 2]);
    Check if within sphere of radius 1 /
    if( (x
    x + yy + zz) < 1.0) {
    /* Store results */
    result[id] += count;

Part 2 of 2:

int main(int argc, char *argv)
int i;
long long total;
curandStateScrambledSobol64 *devSobol64States;
unsigned long long * devDirectionVectors64;
unsigned long long * devScrambleConstants64;
long long *devResults, *hostResults;
int sampleCount = 10000;
int iterations = 100;
double fraction;
double pi = 3.1415926535897932;
unsigned int pedantic;

/* Allow over-ride of sample count */    
if (argc == 2) {
/* Allocate space for results on host */
hostResults = (long long *)calloc(64 * 64, sizeof(long long));

/* Allocate space for results on device */
CUDA_CALL(cudaMalloc((void **)&devResults, 64 * 64 * 
                     sizeof(long long)));

/* Set results to 0 */
CUDA_CALL(cudaMemset(devResults, 0, 64 * 64 * 
                     sizeof(long long)));
/* Allocate memory for 3 states per thread, each state to get a unique dimension */
CUDA_CALL(cudaMalloc((void **)&devSobol64States, 64 * 64 * 3 *
/* Allocate memory and copy 3 sets of vectors per thread to the device */
CUDA_CALL(cudaMalloc((void **)&(devDirectionVectors64), 
                     3 * 4096 * 64 * sizeof(long long)));
CUDA_CALL(cudaMemcpy(devDirectionVectors64, &(scrambled_sobol_v_host[0][0]),
                     3 * 4096 * 64 * sizeof(long long), 
/* Allocate memory and copy 3 scramble constants per thread to the device */
CUDA_CALL(cudaMalloc((void **)&(devScrambleConstants64), 
                     3 * 4096 * sizeof(long long)));
CUDA_CALL(cudaMemcpy(devScrambleConstants64, &(scrambled_sobol_c_host[0]),
                     3 * 4096 * sizeof(long long), 

/* Initialize the states */

setup_kernel<<<64, 64>>>(devDirectionVectors64, devScrambleConstants64, devSobol64States);

/* Generate and count quasi-random points  */

for(i = 0; i < iterations; i++) {
    generate_kernel<<<64, 64>>>(devSobol64States, sampleCount, devResults);

/* Copy device memory to host */

CUDA_CALL(cudaMemcpy(hostResults, devResults, 64 * 64 * 
    sizeof(long long), cudaMemcpyDeviceToHost));

/* Tally and show result */

total = 0.0;
for(i = 0; i < 64 * 64; i++) {
    total += hostResults[i];

fraction = (double)total / (64.0 * 64.0 * (double)sampleCount * iterations);
printf("Fraction inside sphere was %g\n", fraction);
printf("(4/3) pi = %g, sampled volume = %g\n",(4.0*pi/3.0),8.0 * fraction);

/* Cleanup */

printf("^^^^ kernel_sobol_example PASSED\n");

/* touch some variables to satisfy pedantic compilers */
pedantic = ( sobol_v32[0][0] && scrambled_sobol_v32[0][0] && scrambled_sobol_c32[0] && sobol_v_host[0][0] );
if (pedantic) pedantic++;



Hello, I have also a question on the order of using the Host API of Sobol sequence.

My question is now I need to use different Gaussian random sequence. I want to use QMC, so I need different Sobol sequences (let’s say 10). I follow toolkit documentation and it says:

32 and 64 bit SOBOL and Scrambled SOBOL quasirandom generators
When generating n results in d dimensions, the output will consist of n / d results from dimension 1, followed by n / d results from dimension 2, and so on up to dimension d . Only exact multiples of the dimension size may be generated. The dimension parameter d is set with curandSetQuasiRandomGeneratorDimensions() and defaults to 1.

Here, if I use

hostData = (float *)calloc(n,sizeof(float));

Finally in hostData, I am confused which sequence I get

(1) (hostData[1],…,hostData[100]), (hostData[101],…,hostData[200]),…,(hostData[991],…,hostData[1000])


(hostData[1],…,hostData[100]) is Gaussian sequence transferred by Sobol sequence in dim 1

(hostData[101],…,hostData[200]) is Gaussian sequence transferred by Sobol sequence in dim 2
and so on

or I get

(2) (hostData[1],…,hostData[1000]) is Gaussian sequence, and transferred by
({Sobol_1[1], Sobol_1[2],…,Sobol_1[100]} , {Sobol_2[1],Sobol_2[2],…,Sobol_2[100]},…,{Sobol_100[1],Sobol_100[2],…,Sobol_100[100]})

where Sobol_i[j] is the j-th elements of the i-dimensional Sobol sequence.

The output, either in host or device memory, should be:

(hostData[1],…,hostData[100]) is Gaussian sequence transferred by Sobol sequence in dim 1

(hostData[101],…,hostData[200]) is Gaussian sequence transferred by Sobol sequence in dim 2
and so on

Thank you for the explanation and example. However, I tried to run the example you have provided and I cannot seem to find the header file “sobol_direction_vectors.h”. Can you tell me how can I find it?