Undefined reference to '__pgicudalib_curandInitXORWOW'

Dear Nvidia developers:
I just tried the following code from the official guide to learn how to generate random numbers on device.
I compile it use nvfortran -Mcuda=nollvm test_rand.cuf
then it appears
nvlink error : Undefined reference to ‘__pgicudalib_curandInitXORWOW’ in '/tmp/nvfortranuyJd8HZIXaog.o’
pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk_2020_2011/Linux_x86_64/20.11/compilers/bin/tools/nvdd
My compiler is hpc sdk 20.11 cuda 11.2 nvidia-smi 460.32.03 driver version 460.32.03
Thanks much in advance!

        module mrand
            use curand_device
            integer, parameter :: n = 500
            contains
                attributes(global) subroutine randsub(a)
                real, device :: a(n,n,4)
                type(curandStateXORWOW) :: h
                integer(8) :: seed, seq, offset
                j = blockIdx%x; i = threadIdx%x
                seed = 12345_8 + j*n*n + i*2
                seq = 0_8
                offset = 0_8
                call curand_init(seed, seq, offset, h)
                do k = 1, 4
                    a(i,j,k) = curand_uniform(h)
                end do
                end subroutine
        end module

        program t   ! pgfortran -Mcuda=nollvm t.cuf
        use mrand
        use cudafor ! recognize maxval, minval, sum w/managed
        real, managed :: a(n,n,4)
        a = 0.0
        call randsub<<<n,n>>>(a)
        print *,maxval(a),minval(a),sum(a)/(n*n*4)
    end program

Hi Chen_yang,

The “nollvm” sub-option is no longer supported (it’s available but just not supported), but also no longer needed. It used to be required due to the need to bring in a CUDA curand device header into the generated device code, but we’ve since found a way to get it compile correctly using the LLVM back-end.

% nvfortran rand.cuf -V20.11 -Mcuda=nollvm
nvlink error   : Undefined reference to '__pgicudalib_curandInitXORWOW' in '/tmp/nvfortran7GQoXCZeqk4E.o'
pgacclnk: child process exit status 2: /proj/nv/Linux_x86_64/20.11/compilers/bin/tools/nvdd
% nvfortran rand.cuf -V20.11
% a.out
   0.9999960       1.0117656E-06   0.4987832

Hope this helps,
Mat

Thanks much, Mat!
This works!

Hi Mat,
So based on this code, I have another question.
I want to initialize once and then use the curand_uniform again and again. So I think I can change the module to two kernels, one is used to initialize and another is to call the curand_uniform.
So I think I have to pass the parameter h from a previous call to the next call of curand_uniform. My question is type(curandStateXORWOW) :: h is this threadprivate? if so, if I want to have a state variable in global memory, what should I do? I just add a device attribute after the type when declare this variable? like type(curandStateXORWOW), device :: h ?
Or any other thoughts that work is OK for me.
Thanks much!
Chen

No, it would be shared in this case and would cause collisions on the state variable when accessed via multiple threads.

The easiest thing to do is precompute a device array of random numbers, generated via the host side cuRand calls. You’d need to know at least an upper bound on the number of random numbers needed and have each thread index into separate elements of this array.

Another thought is to create a device array of cuRandState variables, essentially manually privatizing them, so the state can be persistent across multiple kernels. I haven’t tried this myself so don’t know exactly how or if it would work, but may be worth an experiment if you can’t precompute the random numbers.

-Mat

Many thanks, Mat!