Undefined reference to `__pgiLaunchKernel' when linking CUDA Fortran subroutine

Hello,

I’m trying to call a Fortran procedure that contains CUDA Fortran from a C++ main function that contains OpenMP directives.

For some reason, I cannot seem to get the linking order correct:

$ make main8 NVCCFLAGS=
nvfortran -O3 -cuda -static-nvidia -Minfo=all -c jacobi_kernel.mod.cuf
nvc++   -cuda -o main8 main8.cpp jacobi_kernel.mod.o 
main8.cpp:
jacobi_kernel.mod.o: In function `jacobi_kernel':
/home/ivan/jacobi_kernel.mod.cuf:38: undefined reference to `__pgiLaunchKernel'
pgacclnk: child process exit status 1: /usr/bin/ld

Is there a missing library I should append?

It works when I do it the other way round using nvfortran for linking:

$ nvc++ -mp=gpu -c main8.cpp 
$ nvfortran -mp=gpu -Mnomain main8.o jacobi_kernel.mod.o -cuda -lstdc++
$ 

(note I had to disable the hidden Fortran main symbol and link the C++ standard library explicitly)

I searched for similar threads, and the answer was always just to append -cuda but I tried and it didn’t work. I am using NVHPC-SDK v23.7:

nvc++ 23.7-0 64-bit target on x86-64 Linux -tp zen3 
NVIDIA Compilers and Tools

Try adding the flag “-cudaforlibs”. This will bring in the needed CUDA Fortran libraries.

-Mat

Thanks!

That was exactly what was missing:

$ make main8
nvfortran -O3 -cuda -static-nvidia -Minfo=all -c jacobi_kernel.mod.cuf
nvc++ -mp=gpu  -cuda -cudaforlibs -o main8 main8.cpp jacobi_kernel.mod.o 
main8.cpp:

I have a follow-up question, which is how can I synchronize OpenMP and CUDA procedures. These will implicitly go to different streams.

For instance with something like this:

// CUDA Fortran Subroutine
//   - gridA and gridB are device pointers
//   - uses the default stream
extern "C" void jacobi(double *gridA, double *gridB, int nx, int ny);

// ...
double *gridA, *gridB;
double t1, t2;

#pragma omp target data map(tofrom: gridA[:nx*ny], gridB[:nx*ny]
{
    t1 = dtime();
    for (int it = 0; it < niters; it++) {
        #pragma omp target data use_device_ptr(gridA,gridB)
        jacobi(gridA, gridB, nx, ny);

        std::swap(gridA,gridB);
    }
    //cudaDeviceSynchronize();
    t2 = dtime();
}

I noticed I was getting wrong timings, until I inserted the synchronization. However I’d like the top-level procedures (calling procedure) to remain CUDA-agnostic. Is there a way to achieve this with OpenMP?

I’m not seeing anything here that would be non-blocking so I’m assuming you’re launching the CUDA kernel in jacobi without a sync. I’m also assuming that since there’s likely a dependency between each iteration you’re launching each jacobi CUDA kernel on the same stream.

Likely the easiest thing to do is call cudaDeviceSynchronize after the jacobi kernel. Yes this makes it block but if my assumptions are correct, the only cost would be that the launch latency isn’t hidden, which isn’t much.

Though if you really do need to sync at the OpenMP, it might be a bit more complicated. If you were using OpenACC, things would be easier since the async queues directly map to a CUDA stream and you can manage multiple streams using multiple queues. But with OpenMP async is tied to a task and here there’s only a single task. So you’ll likely need to get stream that the task is using, via ompx_get_cuda_stream(), and then assign the jacobi kernel to use the task’s stream. Then use “taskwait” at the end of the iteration loop to sync.

I’m not 100% sure on the details since I’ve not tried this before, but hopefully this gives you some direction.

We do use ompx_get_cuda_stream in some of our CUDA Library examples, such as “<NVHPC_DIR>/examples/CUDA-Libraries/cuFFT/test_fft_omp_c/tcufft2dompc1.c” which may provide some ideas.

Your assumptions are correct. This was mainly just a toy example, and the jacobi function looked like this:

   subroutine jacobi(gridA,gridB,width,height) bind(c)
      use cudafor, only: dim3
      integer(c_int), value :: width, height
      real(wp), intent(inout), device :: gridA(width,height)
      real(wp), intent(in), device :: gridB(width,height)
      type(dim3) :: grid, tblock
      tblock = dim3(128,4,1)
      grid = dim3((width + tblock%x - 1)/tblock%x, &
                  (height + tblock%y - 1)/tblock%y, 1)
      call jk<<<grid,tblock>>>(gridA,gridB,width,height)
   end subroutine

Thanks for the pointers w.r.t. the stream handling. I found some advice for mapping between target constructs and streams here: HPC Compilers User's Guide Version 24.9 for ARM, OpenPower, x86

I think I got the gist, so I’d need to do something like:

subroutine jacobi(gridA,gridB,height,width,stream) bind(c)
! ...
   integer(kind=cuda_stream_kind), optional :: stream
      ! the stream is implicitly cast from (cudaStream_t *) in C 

! ...
   if (present(stream)) then
      call jk<<<grid,tblock,stream=stream>>>(...)
   else
      ! default stream = 0
      call jk<<<grid,tblock>>>(...)
! >>> alternative would be to synchronize here <<<
   end if
   
end subroutine

In the C code I’d then have to use:

#include <cuda_runtime.h>

extern "C" void jacobi(double *gridA, 
                       double *gridB, 
                       int nx, 
                       int ny, 
                       cudaStream_t *stream);
// ...
cudaStream_t stream = 
    (cudaStream_t) ompx_get_cuda_stream(omp_get_default_device(), 0);

// ...
    t1 = dtime();
    for (int it = 0; it < niters; it++) {
        #pragma omp target data use_device_ptr(gridA,gridB)
        jacobi(gridA, gridB, nx, ny, &stream);
        std::swap(gridA,gridB);
    }
    #pragma omp taskwait depend(in: stream)
    t2 = dtime();

I’m not entirely sure if the taskwait would work like this, because omp target data is not a task-generating construct, and the stream instance isn’t referenced in any previous depend clause. It could be added to the data mapping constructs, but I’d like to measure only the computational part. I noticed the OpenMP standard has a depend object, but it is unsupported at the moment. I think this would look like this:

cudaStream_t stream = 
    (cudaStream_t) ompx_get_cuda_stream(omp_get_default_device(), 0);

// We use a user-provided dependency object, since the stream
// doesn't appear explicitly in a depend clause of a task-generating construct
omp_depend_t obj;
#pragma omp depbj(obj) depend(inout: stream)

#pragma omp target data map(tofrom: gridA[:nx*ny], gridB[:nx*ny])
{
    t1 = dtime();
    for (int it = 0; it < niters; it++) {
        #pragma omp target data use_device_ptr(gridA,gridB)
        jacobi(gridA, gridB, nx, ny, &stream);
        #pragma omp depobj(obj) update(inout: stream)
        std::swap(gridA,gridB);
    }
    #pragma omp taskwait depend(in: stream)
    t2 = dtime();
}

#pragma omp depobj(obj) destroy 

Obviously, this is a lot of extra typing to achieve essentially the same thing as cudaStreamSynchronize().

An empty task-generating construct like this would probably also work (and involves less typing):

#pragma omp target depend(inout:stream)
{ /* empty */ }

In either case it makes sense now that an explicit mechanism is needed to trigger the synchronization of the OpenMP runtime and the CUDA runtime.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.