Curand, my implementation works, but I am not sure it's the right way to do it

Hi all,
Very new to CUDA so help is much appreciated. Even though my question is about curand, if you spot some fundamental flaws on how I am doing stuff please let me know.

I have the following task:

“Given a list of N numbers, and a rate 0 < R < 100 randomly zero out R% of them.”

My included solution works, and would like to get some feedback regarding how I am manipulating my curandState(s)

It is quite simple, the kernel I am using for this goes:

__global__ void dropr(float *A, curandState *globalstate, uint64_t N, float R)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N) {
        curandState localstate = globalstate[i];
        A[i] *= curand_uniform(&localstate) < R ? 0: 1;
    }
}

This kernel should be able to launch any number of times during the run of my application. (It’s a dropout ANN layer)

My problem is that my main application is written purely in C, and I had a lot of struggle being able to launch this exact kernel from my C code due to the curand API only being able to be compiled in C++. To the extent of how I understand how all this works. So again, let me know if I got something very wrong.

I am came up with the following solution:

  • 3 files:
    1. main C code “testDRP.c”
    2. Intermediate C++ code used to initialize curandState “launcher.cc”
    3. CUDA kernels code “cudaDRP.cu”

testDRP.c calls functions in launcher.cc who in turns launches kernels in cudaDRP

The way I understand curand, and correct me if I got it completely wrong is that each instance of a kernel executing a curand function requires its own curandState due to the non-threadsafe nature of the curand generators. Is this interpretation correct?

So “launcher” in its “init()” function, allocates enough memory for n curandState(s), corresponding to working on n numbers. I use a global variable to store all these states. So every time I compute a random number, a unique curandState is used.

“launcher.cc”

curandState *globalstate;
extern "C" void init(int n, int *res)
{
    *res = 1;
    cudaError_t stat;
    stat = cudaMalloc((void **)&globalstate,  n*sizeof(curandState));
    if (stat != cudaSuccess) {
        fprintf(stderr, "cuda malloc failed\n");
        *res = 0;
        return;
    }
    curand_init(globalstate, n);
    if (cudaSuccess != cudaDeviceSynchronize()) { //Is this needed?
        fprintf(stderr, "curand init failed\n");
        *res = 0;
    }
}

Finally the actual computation is started in launcher.cc’s launch() function which passes the array, the list of global states, length of the array, and a rate.

launcher.cc

extern "C" void launch(unsigned int *a, uint64_t n, float r)
{
    curand_drand(a, globalstate, n, r);
}

cudaDRP.cu

__host__ void curand_drand(float *A,
                           curandState *globalstate,
                           uint64_t L,
                           float R)
{
    int threadsPerBlock = 256;
    int blocksPerGrid = (L + threadsPerBlock - 1) / threadsPerBlock;
    dropr<<<threadsPerBlock,blocksPerGrid>>>(A, globalstate, L, R);
    if (cudaDeviceSynchronize() != cudaSuccess) {
        fprintf(stderr, "Cuda call failed\n");
    }
}

To compiled the attached filestestDRP.c (1.7 KB) launcher.cc (884 Bytes) cudaDRP.cu (1.4 KB) :

nvcc -g -c -o cudaDRP.o cudaDRP.cu
gcc -Wall -pedantic -c -g -o launcher.o launcher.cc 
gcc -Wall -pedantic -o testDRP testDRP.c launcher.o cudaDRP.o -lcudart
./testDRP N R/100 | grep "0" | wc -l

You get an answer close to N*(R/100)

Thank in advance for any help or comments.

Yes, it is correct that you should have a separate curandState for each CUDA thread (what you call “each instance of a kernel”). I guess you could perhaps say that is “due to the non-threadsafe…” but that isn’t the way I would describe it. You want the possibility for each thread to have an independent sequence, including separate generator positions and separate random seeds. That basically needs an independent state for each thread.

This is a very basic usage of curand. I haven’t studied your code in great detail but it generally seems to me you have the basic idea correct. Note that the curand docs include various example codes, and there are also cuda sample codes that demonstrate proper usage.

A few editorial comments:

  1. If this is all you intend to do on the GPU, you might be wasting your time. The cost to copy the weights (A[] matrix) to and from the GPU are going to offset any performance benefit gained from using the GPU for this very simple op. The exception to this might be the cost of random number generation, but I would still be somewhat surprised if that alone swung this to the benefit column.

  2. Understanding the concern raised in item 1 (potentially “polishing junk” here) you can probably do a bit better with your zeroing kernel realization. This line of code requires every data item to be read once and then written once:

     A[i] *= curand_uniform(&localstate) < R ? 0: 1;
    

    for bandwidth bound codes, the number of reads and writes is a principal figure of merit. reducing those often increases performance. There is no particular reason you need to read the data here. Something like this should suffice:

    if (curand_uniform(&localstate) < R) A[i] = 0;
    

    in so doing you have cut out the read operation.

  3. For repeated usage, its customary to write the state used back to the global state. If you only call your dropr kernel once, this is irrelevant. But if you call it multiple times, you will get the exact same pattern in your random data generation on repeated calls, unless you update the global state after using it with eg:

     curandState localstate = globalstate[i];
     if (curand_uniform(&localstate) < R) A[i] = 0;
     globalstate[i] =  localstate;  // add this
    

    The call to the generator also updates the state. This is so the generator will generate something different, the next time you call it. It’s best to think of things per-thread here for understanding. Every thread is basically independent from every other thread.

  4. Given that the call to curand_init is generally “costly”, unless you have specific statistical needs for random number generation, for large A data sets, I would advise to employ a grid-stride loop for the the generate function. Only create (init) enough state for a complement of threads that will maximally saturate your GPU (and no more), then use a grid-stride loop methodology in your generator kernel, potentially generating multiple results per CUDA thread. Generating new random numbers once the generator is set up is generally cheaper than creating more state. So fill the GPU, then reuse state beyond that.

Hi Rob,
First of all, thank you very much for taking the time to check my post. Really appreciate it.

The curand docs have been very useful, I just have a little comment. For new people, the example code come like a steam roller. It makes the learning cure quite flat.

I guess I should elaborate a bit more on what I am trying to achieve. I am learning CUDA programming by implementing a neural network library. It is a very basic thing, so far only fully connected layers are implemented.

I am going about this using a mixture of existing CUDA libraries in addition to implementing by myself the easier tasks. For example, I am using cuBLAS for things like matrix multiplication and implementing by myself things like applying non-linearities like the sigmoid function or, as decribed in this post, dropout.

Without going into too much detail, the CPU side of my application loads data into the GPU such as the weights and biases of my models, as well as the training and/or testing data. The GPU then does most of the computation, mainly the forward and backward passes. So my dropout computation will be used hundreds of thousand times.

Would passing the address of the element in *globalstate be equivalent?

So insted of:

 curandState localstate = globalstate[i];
 if (curand_uniform(&localstate) < R) A[i] = 0;
 globalstate[i] =  localstate;  // add this

Have:

 //curandState localstate = globalstate[i];
                           |
                           |
                          \ /
                           .
 if (curand_uniform( &globalstate[i] ) < R) A[i] = 0;
 //globalstate[i] =  localstate;  // add this

I will check into this.
In general, the cost of curand_init() is not much of a problem as I do the initialization once, 1 state for each neuron on the layer. But I will study grid-stride looping in a bit more detail.

I do not need very strong statistical rigor for my random numbers. As long as curand_uniform() is independent between neurons (which I assume I have achieved as each neuron has it’s own state), and of course uniform in (0,1) the layer should behave as expected.

Once again, thank you very much for the input.

Yes, however for repeated calls to curand_uniform() in the same thread, it may be more efficient to load the state once “into the thread”, then use it, then save the state when the thread retires, or when you are finished using it.