curand_uniform in a function with OpenACC

I tried using the CURAND function declared in the Header “openacc_curand.h” in a function that is called within a parallel region. I basically use the code provided in “test_rand_oacc_cpp/trand8.cpp” as an example.

So using the function directly works fine, such as follows

#pragma acc parallel num_gangs(num_gangs) copyout(a[0:num_gangs]) private(state)
{
    a[__pgi_gangidx()] = curand_uniform(&state);
}

however, when I put it in a function like so:

#pragma acc routine seq
float rng(curandState_t* p_state)
{
    return curand_uniform(p_state);
}
#pragma acc parallel num_gangs(num_gangs) copyout(a[0:num_gangs]) private(state)
{
    a[__pgi_gangidx()] = rng(&state);
}

I get the linking error:
undefined reference to `__pgicudalib_curandUniformXORWOW’

I think I somehow need to tell the compiler that the function rng() should only be compiled for the kernel, however I didn’t manage to get it working.
Any help would be much appreciated.

Hi rob_v8,

How are you compiling the code?

Calling cuRand routines from device code requires using our CUDA device code generator, i.e. “-ta=tesla:nollvm”.

-Mat

Hi Mat,

yes I compile with
pgc++ --c++17 -acc -ta=tesla:nollvm -Mcudalib=curand -Minfo=accel -O3 test_rng.cpp -o test.bin

best wishes,
Rob

Hi Rob,

To be clear, you do compile with “nollvm”, but still get the error?

If so, which compiler version are you using and what GNU version do you have installed?

We did have a similar report on StackOverflow (See: https://stackoverflow.com/questions/55132501/openacc-curand-cmake-undefined-reference-to-pgicudalib-curanduniformxorwow) where our back-end CUDA compiler wasn’t getting the proper C++ language flag passed to it when using newer GNU versions. This in turn causes the header file to not include the proper symbols.

We fixed this issue in PGI 19.4 so updating your compiler version should fix the issue. For earlier compiler versions, I include a workaround in my second answer on SO.

If you are using 19.4 or later, please post a full example and I’ll investigate.

-Mat

Hi Mat,

yes I get the error while using “nollvm”.
The compiler version is “pgc++ 19.4-0”.
GCC version is 7.3.0

Here is the example I get the error with:

#include <iostream>
#include <openacc.h>
#include "openacc_curand.h"

#pragma acc routine seq
float rng(curandState_t* p_state)
{
    return curand_uniform(p_state);
}

int main(int argc, char* argv[])
{
    int num_gangs = 13;

    float* a = new float[num_gangs];

    curandState_t state;
    #pragma acc parallel num_gangs(num_gangs) copyout(a[0:num_gangs]) private(state)
    {
        curand_init(__pgi_gangidx()*42, 0ULL, 0ULL, &state);

       a[__pgi_gangidx()] = rng(&state); //curand_uniform(&state);
    }

    for( int i = 0; i < num_gangs; ++i )
        std::cout << a[i] <<"\n";
}

compiled with: “pgc++ -acc -ta=tesla:nollvm -Minfo=accel test_rng2.cpp”
When using curand_uniform() directly instead of rng(), it compiles fine.
I am not sure if I am missing a specifier in the #pragma routine or something else.
Thanks for the help.
best,
Rob

Hi Rob,

Sorry, I should have noticed this before. The problem here is that the default for “acc routine” is to create both a host and device callable version of the routine. The undefined references are coming from the host.

A couple of options to fix.

  1. Add “nohost” to “routine” so no host version is created:
#pragma acc routine seq nohost
float rng(curandState_t* p_state)
{
    return curand_uniform(p_state);
}
  1. Inline the routine
inline float rng(curandState_t* p_state)
{
    return curand_uniform(p_state);
}

Hope this helps,
Mat

Example:

% cat curand.cpp
#include <iostream>
#include <openacc.h>
#include "openacc_curand.h"

#pragma acc routine seq nohost
inline float rng(curandState_t* p_state)
{
    return curand_uniform(p_state);
}

int main(int argc, char* argv[])
{
    int num_gangs = 13;

    float* a = new float[num_gangs];

    curandState_t state;
    #pragma acc parallel num_gangs(num_gangs) copyout(a[0:num_gangs]) private(state)
    {
       curand_init(__pgi_gangidx()*42, 0ULL, 0ULL, &state);
       a[__pgi_gangidx()] = rng(&state); //curand_uniform(&state);
    }

    for( int i = 0; i < num_gangs; ++i )
        std::cout << a[i] <<"\n";
}

% pgc++ -ta=tesla:nollvm -Mcudalib=curand curand.cpp -V19.4 ; a.out
0.740219
0.0700209
0.742021
0.93318
0.105142
0.401238
0.798647
0.428364
0.530524
0.731388
0.274591
0.322113
0.705976

Works like a charm! Thanks a lot for the help.
cheers,
Rob