Simple way of generating normally distributed random numbers with cuRAND in parallel loop

I’m trying to generate normally distributed random numbers in an OpenACC parallel loop, but since the C++ STL <random> library doesn’t work with parallel code, I had a look at the cuRAND library. However, after looking around at various websites and guides, there seem to be many ways to use cuRAND, all of which look quite complicated (thread IDs, block IDs, mallocs, deviceptr directives, etc.). As such, I thought I’d give a simple example of using the <random> library, and ask if there’s an equivalently simple way of doing this with cuRAND. The following program generates a vector of random numbers using the <random> library:

#include <random>
#include <vector>
#include <iostream>
#include <chrono>

class RandomVectorGenerator {
    RandomVectorGenerator(const size_t num_elements, const float initial_value);
    float add_random_noise_to_num(const float x);
    void fill_vector_with_random_nums();

        size_t num_elements_;
        std::vector<float> random_nums_;

        std::random_device seed_;
        std::ranlux24_base random_engine_;

RandomVectorGenerator::RandomVectorGenerator(const size_t num_elements, const float initial_value) {
    num_elements_ = num_elements;
    random_nums_ = std::vector<float>(num_elements_, initial_value);
    random_engine_ = std::ranlux24_base(seed_());

    #pragma acc enter data copyin(this)

// #pragma acc routine seq
float RandomVectorGenerator::add_random_noise_to_num(const float x) {
    float mean = 0.0f;
    float std_dev = 10.0f;
    std::normal_distribution<float> random_num_sampler(mean, std_dev);

    return x + random_num_sampler(random_engine_);

void RandomVectorGenerator::fill_vector_with_random_nums() {
    float *random_nums_ptr =;

    // #pragma acc parallel loop copy(random_nums_ptr)
    for(size_t i = 0; i < num_elements_; ++i) {
        random_nums_ptr[i] = add_random_noise_to_num(random_nums_ptr[i]);

int main() {
    size_t num_elements = 10000000;
    float initial_value = 10.0f;

    RandomVectorGenerator random_vector_generator(num_elements, initial_value);
    auto start = std::chrono::system_clock::now();


    auto end = std::chrono::system_clock::now();
    std::chrono::duration<float> diff = end - start;
    std::cout << "Random vector generation time: " << diff.count() << " seconds\n";

Where the seed and random number generator are instantiated in the RandomVectorGenerator constructor as follows:

std::random_device seed_;
std::ranlux24_base random_engine_;

Furthermore, the add_random_noise_to_num() function is used to produce random numbers using a normal distribution as follows:

std::normal_distribution<float> random_num_sampler(mean, std_dev);
return x + random_num_sampler(random_engine_);

Finally, the random numbers are generated and timed in the main() function, where compiling and running the program produces:

$ nvc++ -O3 -acc -Minfo=accel random_gen_test.cpp 
$ ./a.out 
Random vector generation time: 1.75203 seconds

As such, I was wondering if there’s an equally simple way of generating normally distributed random numbers in a parallel loop using the cuRAND library instead of the STL <random> library. Any help would be appreciated.

Hi jeffr1992,

In general, running a RNG in parallel is problematic. RNGs contain state which is often shared, making the RNG unsafe to parallelize. Instead, each iteration of the parallel loop would need to maintain it’s own state in order to avoid race conditions. While we can do this with cuRAND, the cost of maintaining state for each iteration is high. Plus, you need to pass in a set of randomly generated seeds so each instance of the RNG is unique. Since your only using one random number per iteration, you’re better off calling cuRAND from the host to generate an array of random values and then using this array in the parallel loop.

Recently I had the opportunity to work with Johan Carlsson on a pure OpenACC device side RNG implementation. Like cuRAND, you’d want to use it if your generating many random number per loop iteration. For cases where you’re using one random number per iteration, its still better to precompute the random values. Though unlike cuRAND Johan’s DES PRNG implementation is much lighter weight so has less overhead. For full details on DES PRNG see:

FYI, we are investigating creating on a device side version of std::random. Mostly for use with our C++ standard language parallelism support, but hopefully will work with OpenACC as well. It’s still early so I don’t know if/when it will be available in a release.