Undefined reference to `__pgiLaunchKernel' when linking CUDA Fortran subroutine

ivan.pribec1 · October 9, 2024, 3:55pm

Hello,

I’m trying to call a Fortran procedure that contains CUDA Fortran from a C++ main function that contains OpenMP directives.

For some reason, I cannot seem to get the linking order correct:

$ make main8 NVCCFLAGS=
nvfortran -O3 -cuda -static-nvidia -Minfo=all -c jacobi_kernel.mod.cuf
nvc++   -cuda -o main8 main8.cpp jacobi_kernel.mod.o 
main8.cpp:
jacobi_kernel.mod.o: In function `jacobi_kernel':
/home/ivan/jacobi_kernel.mod.cuf:38: undefined reference to `__pgiLaunchKernel'
pgacclnk: child process exit status 1: /usr/bin/ld

Is there a missing library I should append?

It works when I do it the other way round using nvfortran for linking:

$ nvc++ -mp=gpu -c main8.cpp 
$ nvfortran -mp=gpu -Mnomain main8.o jacobi_kernel.mod.o -cuda -lstdc++
$

(note I had to disable the hidden Fortran main symbol and link the C++ standard library explicitly)

I searched for similar threads, and the answer was always just to append -cuda but I tried and it didn’t work. I am using NVHPC-SDK v23.7:

nvc++ 23.7-0 64-bit target on x86-64 Linux -tp zen3 
NVIDIA Compilers and Tools

MatColgrove · October 9, 2024, 10:41pm

Try adding the flag “-cudaforlibs”. This will bring in the needed CUDA Fortran libraries.

-Mat

ivan.pribec1 · October 10, 2024, 7:00am

Thanks!

That was exactly what was missing:

$ make main8
nvfortran -O3 -cuda -static-nvidia -Minfo=all -c jacobi_kernel.mod.cuf
nvc++ -mp=gpu  -cuda -cudaforlibs -o main8 main8.cpp jacobi_kernel.mod.o 
main8.cpp:

ivan.pribec1 · October 10, 2024, 2:10pm

I have a follow-up question, which is how can I synchronize OpenMP and CUDA procedures. These will implicitly go to different streams.

For instance with something like this:

// CUDA Fortran Subroutine
//   - gridA and gridB are device pointers
//   - uses the default stream
extern "C" void jacobi(double *gridA, double *gridB, int nx, int ny);

// ...
double *gridA, *gridB;
double t1, t2;

#pragma omp target data map(tofrom: gridA[:nx*ny], gridB[:nx*ny]
{
    t1 = dtime();
    for (int it = 0; it < niters; it++) {
        #pragma omp target data use_device_ptr(gridA,gridB)
        jacobi(gridA, gridB, nx, ny);

        std::swap(gridA,gridB);
    }
    //cudaDeviceSynchronize();
    t2 = dtime();
}

I noticed I was getting wrong timings, until I inserted the synchronization. However I’d like the top-level procedures (calling procedure) to remain CUDA-agnostic. Is there a way to achieve this with OpenMP?

MatColgrove · October 11, 2024, 3:54pm

I’m not seeing anything here that would be non-blocking so I’m assuming you’re launching the CUDA kernel in jacobi without a sync. I’m also assuming that since there’s likely a dependency between each iteration you’re launching each jacobi CUDA kernel on the same stream.

Likely the easiest thing to do is call cudaDeviceSynchronize after the jacobi kernel. Yes this makes it block but if my assumptions are correct, the only cost would be that the launch latency isn’t hidden, which isn’t much.

Though if you really do need to sync at the OpenMP, it might be a bit more complicated. If you were using OpenACC, things would be easier since the async queues directly map to a CUDA stream and you can manage multiple streams using multiple queues. But with OpenMP async is tied to a task and here there’s only a single task. So you’ll likely need to get stream that the task is using, via ompx_get_cuda_stream(), and then assign the jacobi kernel to use the task’s stream. Then use “taskwait” at the end of the iteration loop to sync.

I’m not 100% sure on the details since I’ve not tried this before, but hopefully this gives you some direction.

We do use ompx_get_cuda_stream in some of our CUDA Library examples, such as “<NVHPC_DIR>/examples/CUDA-Libraries/cuFFT/test_fft_omp_c/tcufft2dompc1.c” which may provide some ideas.

ivan.pribec1 · October 16, 2024, 1:34pm

Your assumptions are correct. This was mainly just a toy example, and the jacobi function looked like this:

   subroutine jacobi(gridA,gridB,width,height) bind(c)
      use cudafor, only: dim3
      integer(c_int), value :: width, height
      real(wp), intent(inout), device :: gridA(width,height)
      real(wp), intent(in), device :: gridB(width,height)
      type(dim3) :: grid, tblock
      tblock = dim3(128,4,1)
      grid = dim3((width + tblock%x - 1)/tblock%x, &
                  (height + tblock%y - 1)/tblock%y, 1)
      call jk<<<grid,tblock>>>(gridA,gridB,width,height)
   end subroutine

Thanks for the pointers w.r.t. the stream handling. I found some advice for mapping between target constructs and streams here: HPC Compilers User's Guide Version 24.9 for ARM, OpenPower, x86

I think I got the gist, so I’d need to do something like:

subroutine jacobi(gridA,gridB,height,width,stream) bind(c)
! ...
   integer(kind=cuda_stream_kind), optional :: stream
      ! the stream is implicitly cast from (cudaStream_t *) in C 

! ...
   if (present(stream)) then
      call jk<<<grid,tblock,stream=stream>>>(...)
   else
      ! default stream = 0
      call jk<<<grid,tblock>>>(...)
! >>> alternative would be to synchronize here <<<
   end if
   
end subroutine

In the C code I’d then have to use:

#include <cuda_runtime.h>

extern "C" void jacobi(double *gridA, 
                       double *gridB, 
                       int nx, 
                       int ny, 
                       cudaStream_t *stream);
// ...
cudaStream_t stream = 
    (cudaStream_t) ompx_get_cuda_stream(omp_get_default_device(), 0);

// ...
    t1 = dtime();
    for (int it = 0; it < niters; it++) {
        #pragma omp target data use_device_ptr(gridA,gridB)
        jacobi(gridA, gridB, nx, ny, &stream);
        std::swap(gridA,gridB);
    }
    #pragma omp taskwait depend(in: stream)
    t2 = dtime();

I’m not entirely sure if the taskwait would work like this, because omp target data is not a task-generating construct, and the stream instance isn’t referenced in any previous depend clause. It could be added to the data mapping constructs, but I’d like to measure only the computational part. I noticed the OpenMP standard has a depend object, but it is unsupported at the moment. I think this would look like this:

cudaStream_t stream = 
    (cudaStream_t) ompx_get_cuda_stream(omp_get_default_device(), 0);

// We use a user-provided dependency object, since the stream
// doesn't appear explicitly in a depend clause of a task-generating construct
omp_depend_t obj;
#pragma omp depbj(obj) depend(inout: stream)

#pragma omp target data map(tofrom: gridA[:nx*ny], gridB[:nx*ny])
{
    t1 = dtime();
    for (int it = 0; it < niters; it++) {
        #pragma omp target data use_device_ptr(gridA,gridB)
        jacobi(gridA, gridB, nx, ny, &stream);
        #pragma omp depobj(obj) update(inout: stream)
        std::swap(gridA,gridB);
    }
    #pragma omp taskwait depend(in: stream)
    t2 = dtime();
}

#pragma omp depobj(obj) destroy

Obviously, this is a lot of extra typing to achieve essentially the same thing as cudaStreamSynchronize().

An empty task-generating construct like this would probably also work (and involves less typing):

#pragma omp target depend(inout:stream)
{ /* empty */ }

In either case it makes sense now that an explicit mechanism is needed to trigger the synchronization of the OpenMP runtime and the CUDA runtime.

system · October 30, 2024, 1:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	65	February 11, 2025
OpenMP offload w/ CUDA interop: undefined reference to `__fatbinwrap__NV_MODULE_ID' nvc, nvc++ and nvfortran	5	1055	May 22, 2023
cudaLaunchKernel returned status 98: invalid device function nvc, nvc++ and nvfortran	15	4190	January 31, 2023
Nvfortran error nvc, nvc++ and nvfortran	39	3372	January 17, 2024
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2118	February 5, 2020
CUDA Fortran - global synchronization Legacy PGI Compilers	19	5652	February 11, 2019
NVFORTRAN-S-0034-Syntax error at or near constant nvc, nvc++ and nvfortran	4	326	May 3, 2024
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2362	March 11, 2024
OpenACC Region: Command exited with non-zero status 1 nvc, nvc++ and nvfortran cuda	21	1853	October 14, 2021
Undefined reference to MAIN_" error using nvfortran nvc, nvc++ and nvfortran cuda	15	557	June 3, 2024

Undefined reference to `__pgiLaunchKernel' when linking CUDA Fortran subroutine

Related topics