Undefined reference to '__pgicudalib_curandInitXORWOW'

Chen_Yang · January 28, 2021, 10:26pm

Dear Nvidia developers:
I just tried the following code from the official guide to learn how to generate random numbers on device.
I compile it use nvfortran -Mcuda=nollvm test_rand.cuf
then it appears
nvlink error : Undefined reference to ‘__pgicudalib_curandInitXORWOW’ in ‘/tmp/nvfortranuyJd8HZIXaog.o’
pgacclnk: child process exit status 2: /opt/nvidia/hpc_sdk_2020_2011/Linux_x86_64/20.11/compilers/bin/tools/nvdd
My compiler is hpc sdk 20.11 cuda 11.2 nvidia-smi 460.32.03 driver version 460.32.03
Thanks much in advance!

        module mrand
            use curand_device
            integer, parameter :: n = 500
            contains
                attributes(global) subroutine randsub(a)
                real, device :: a(n,n,4)
                type(curandStateXORWOW) :: h
                integer(8) :: seed, seq, offset
                j = blockIdx%x; i = threadIdx%x
                seed = 12345_8 + j*n*n + i*2
                seq = 0_8
                offset = 0_8
                call curand_init(seed, seq, offset, h)
                do k = 1, 4
                    a(i,j,k) = curand_uniform(h)
                end do
                end subroutine
        end module

        program t   ! pgfortran -Mcuda=nollvm t.cuf
        use mrand
        use cudafor ! recognize maxval, minval, sum w/managed
        real, managed :: a(n,n,4)
        a = 0.0
        call randsub<<<n,n>>>(a)
        print *,maxval(a),minval(a),sum(a)/(n*n*4)
    end program

MatColgrove · January 28, 2021, 10:50pm

Hi Chen_yang,

The “nollvm” sub-option is no longer supported (it’s available but just not supported), but also no longer needed. It used to be required due to the need to bring in a CUDA curand device header into the generated device code, but we’ve since found a way to get it compile correctly using the LLVM back-end.

% nvfortran rand.cuf -V20.11 -Mcuda=nollvm
nvlink error   : Undefined reference to '__pgicudalib_curandInitXORWOW' in '/tmp/nvfortran7GQoXCZeqk4E.o'
pgacclnk: child process exit status 2: /proj/nv/Linux_x86_64/20.11/compilers/bin/tools/nvdd
% nvfortran rand.cuf -V20.11
% a.out
   0.9999960       1.0117656E-06   0.4987832

Hope this helps,
Mat

Chen_Yang · January 28, 2021, 10:53pm

Thanks much, Mat!
This works!

Chen_Yang · February 8, 2021, 3:15am

Hi Mat,
So based on this code, I have another question.
I want to initialize once and then use the curand_uniform again and again. So I think I can change the module to two kernels, one is used to initialize and another is to call the curand_uniform.
So I think I have to pass the parameter h from a previous call to the next call of curand_uniform. My question is type(curandStateXORWOW) :: h is this threadprivate? if so, if I want to have a state variable in global memory, what should I do? I just add a device attribute after the type when declare this variable? like type(curandStateXORWOW), device :: h ?
Or any other thoughts that work is OK for me.
Thanks much!
Chen

MatColgrove · February 8, 2021, 3:18pm

No, it would be shared in this case and would cause collisions on the state variable when accessed via multiple threads.

The easiest thing to do is precompute a device array of random numbers, generated via the host side cuRand calls. You’d need to know at least an upper bound on the number of random numbers needed and have each thread index into separate elements of this array.

Another thought is to create a device array of cuRandState variables, essentially manually privatizing them, so the state can be persistent across multiple kernels. I haven’t tried this myself so don’t know exactly how or if it would work, but may be worth an experiment if you can’t precompute the random numbers.

-Mat

Chen_Yang · February 9, 2021, 6:04pm

Many thanks, Mat!

Chen_Yang · June 5, 2021, 7:09pm

Hi Mat,
I think you are right. So if I know the number of random-numbers. How could I set device array for cuRandState in global memory?

Do you think type(curandStateXORWOW), allocatable, device :: h(:) works? Thanks again!

Then each thread can have one as an element in this array. I just need to copy forth and back between local and global. But now I don’t know how to create derived data type using global memory.

I checked type(curandStateXORWOW) in PGI’s manual and its member variables are not with device/managed, so if only declare it by adding device/managed with the wrapper type(curandStateXORWOW), is it ok?

Perhaps, I can declare the cuRandState by myself instead of using built-in type(curandStateXORWOW)? Then I can add managed for both members and the wrapper?

Thank you very much!
Chen

MatColgrove · June 7, 2021, 4:21pm

Hi Chen,

if I know the number of random-numbers.

If you do know the total number of random-numbers, it’s still best to precompute the set by calling cuRand from the host, filling an array with the results and then passing this into the kernel.

However, if you’re wanting each thread to compute an arbitrary number on the device, then you could do something like the following. (Yes, you can allocate an array of states). Note that I just wrote this and have only done basic testing and have not checked if the distribution is good or not.

Example:

% cat rand.cuf
         module mrand
                use curand_device
                integer, parameter :: n = 500
                type(curandStateXORWOW),device,allocatable,dimension(:,:) :: h
                contains
                    attributes(global) subroutine createRand()
                       integer(8) :: seed,seq,offset,i,j
                       !type(curandStateXORWOW),dimension(:,:) :: h
                       j = blockIdx%x; i = threadIdx%x
                       seed = 12345_8 + j*n*n + i*2
                       seq = 0_8
                       offset = 0_8
                       call curand_init(seed, seq, offset, h(i,j))
                    end subroutine createRand

                    attributes(global) subroutine randsub(a)
                    !type(curandStateXORWOW),dimension(:,:) :: h
                    real :: a(n,n,4)
                    integer(8) :: i,j,k
                    j = blockIdx%x; i = threadIdx%x
                    do k = 1, 4
                        a(i,j,k) = curand_uniform(h(i,j))
                    end do
                    end subroutine
         end module

         program t
            use mrand
            use cudafor ! recognize maxval, minval, sum w/managed
            real, managed :: a(n,n,4)
            a = 0.0
            allocate(h(n,n))
            call createRand<<<n,n>>>()
            call randsub<<<n,n>>>(a)
            print *,maxval(a),minval(a),sum(a)/(n*n*4)
            call randsub<<<n,n>>>(a)
            print *,maxval(a),minval(a),sum(a)/(n*n*4)
            call randsub<<<n,n>>>(a)
            print *,maxval(a),minval(a),sum(a)/(n*n*4)
        end program
    % nvfortran rand.cuf; a.out
       0.9999960       1.0117656E-06   0.4987832
       0.9999998       3.0419324E-07   0.5002192
       0.9999997       1.4506513E-06   0.5000274

-Mat

Chen_Yang · June 7, 2021, 4:26pm

Thank you so much, Mat! This is awesome! I will get something back to you after I apply it in my code. It is a long run development. :)