Simple way of generating normally distributed random numbers with cuRAND in parallel loop

jeffr1992 · January 11, 2021, 12:10am

I’m trying to generate normally distributed random numbers in an OpenACC parallel loop, but since the C++ STL <random> library doesn’t work with parallel code, I had a look at the cuRAND library. However, after looking around at various websites and guides, there seem to be many ways to use cuRAND, all of which look quite complicated (thread IDs, block IDs, mallocs, deviceptr directives, etc.). As such, I thought I’d give a simple example of using the <random> library, and ask if there’s an equivalently simple way of doing this with cuRAND. The following program generates a vector of random numbers using the <random> library:

#include <random>
#include <vector>
#include <iostream>
#include <chrono>

class RandomVectorGenerator {
    public:
    RandomVectorGenerator(const size_t num_elements, const float initial_value);
    
    float add_random_noise_to_num(const float x);
    void fill_vector_with_random_nums();

    private:
        size_t num_elements_;
        std::vector<float> random_nums_;

        std::random_device seed_;
        std::ranlux24_base random_engine_;
};

RandomVectorGenerator::RandomVectorGenerator(const size_t num_elements, const float initial_value) {
    num_elements_ = num_elements;
    random_nums_ = std::vector<float>(num_elements_, initial_value);
    
    random_engine_ = std::ranlux24_base(seed_());

    #pragma acc enter data copyin(this)
}

// #pragma acc routine seq
float RandomVectorGenerator::add_random_noise_to_num(const float x) {
    float mean = 0.0f;
    float std_dev = 10.0f;
    std::normal_distribution<float> random_num_sampler(mean, std_dev);

    return x + random_num_sampler(random_engine_);
}

void RandomVectorGenerator::fill_vector_with_random_nums() {
    float *random_nums_ptr = random_nums_.data();

    // #pragma acc parallel loop copy(random_nums_ptr)
    for(size_t i = 0; i < num_elements_; ++i) {
        random_nums_ptr[i] = add_random_noise_to_num(random_nums_ptr[i]);
    }
}

int main() {
    size_t num_elements = 10000000;
    float initial_value = 10.0f;

    RandomVectorGenerator random_vector_generator(num_elements, initial_value);
    
    auto start = std::chrono::system_clock::now();

    random_vector_generator.fill_vector_with_random_nums();

    auto end = std::chrono::system_clock::now();
    std::chrono::duration<float> diff = end - start;
    std::cout << "Random vector generation time: " << diff.count() << " seconds\n";
}

Where the seed and random number generator are instantiated in the RandomVectorGenerator constructor as follows:

std::random_device seed_;
std::ranlux24_base random_engine_;

Furthermore, the add_random_noise_to_num() function is used to produce random numbers using a normal distribution as follows:

std::normal_distribution<float> random_num_sampler(mean, std_dev);
return x + random_num_sampler(random_engine_);

Finally, the random numbers are generated and timed in the main() function, where compiling and running the program produces:

$ nvc++ -O3 -acc -Minfo=accel random_gen_test.cpp 
$ ./a.out 
Random vector generation time: 1.75203 seconds

As such, I was wondering if there’s an equally simple way of generating normally distributed random numbers in a parallel loop using the cuRAND library instead of the STL <random> library. Any help would be appreciated.

MatColgrove · January 11, 2021, 5:21pm

Hi jeffr1992,

In general, running a RNG in parallel is problematic. RNGs contain state which is often shared, making the RNG unsafe to parallelize. Instead, each iteration of the parallel loop would need to maintain it’s own state in order to avoid race conditions. While we can do this with cuRAND, the cost of maintaining state for each iteration is high. Plus, you need to pass in a set of randomly generated seeds so each instance of the RNG is unique. Since your only using one random number per iteration, you’re better off calling cuRAND from the host to generate an array of random values and then using this array in the parallel loop.

Recently I had the opportunity to work with Johan Carlsson on a pure OpenACC device side RNG implementation. Like cuRAND, you’d want to use it if your generating many random number per loop iteration. For cases where you’re using one random number per iteration, its still better to precompute the random values. Though unlike cuRAND Johan’s DES PRNG implementation is much lighter weight so has less overhead. For full details on DES PRNG see: Pseudo Random Number Generation by Lightweight Threads | OpenACC

FYI, we are investigating creating on a device side version of std::random. Mostly for use with our C++ standard language parallelism support, but hopefully will work with OpenACC as well. It’s still early so I don’t know if/when it will be available in a release.

-Mat

Topic		Replies	Views
Random numbers inside OpenACC loop (nvfortran compiler) nvc, nvc++ and nvfortran	7	1293	September 11, 2023
Random Numbers in OpenACC nvc, nvc++ and nvfortran cuda	5	740	January 5, 2024
CURAND CURAND low per CUDA Programming and Performance	8	2978	April 12, 2011
Problem with curand CUDA Programming and Performance cuda	8	959	October 12, 2021
Random numbers on device generation Legacy PGI Compilers	2	2689	June 14, 2018
Curand, my implementation works, but I am not sure it's the right way to do it CUDA Programming and Performance cuda	3	959	April 26, 2021
CURAND initialization time CUDA Programming and Performance	8	12060	March 8, 2019
Differences between host API and device API for CURAND? CUDA Programming and Performance	4	12038	February 16, 2011
Question about optimal cuRAND() use GPU-Accelerated Libraries	7	2643	April 27, 2015
random number generation Legacy PGI Compilers	2	5871	March 25, 2015

Simple way of generating normally distributed random numbers with cuRAND in parallel loop

Related topics