Issues with migrating OpenACC codes to a newer card and HPC SDK

Hello,

I have an existing OpenACC code that works okay with V100 and A100, It was developed using an older version of HPC SDK. The code is a mix of CUDA, OpenACC and MPI. I am now developing the code on RTX A5000 and I am using HPC SDK 24.9.

The compile process seems okay but when I run the code. I got the following error message:

Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86
Rebuild this file with -gpu=cc86 to use NVIDIA Tesla GPU 0
Rebuild this file with -gpu=cc86 to use NVIDIA Tesla GPU 1
Rebuild this file with -gpu=cc86 to use NVIDIA Tesla GPU 2
 File: /home/gpu/xcode/lib/sys/inc/utils/proto.h
 Function: _Z7setvgpuIdEviiiT_PS0_i:163
 Line: 168

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[32479,1],0]
  Exit code:    1

I am compiling the code with the following option:

CCMP=nvc++
PCCMP=mpicxx
COPT=-fast -acc -Minfo=accel -gpu=cc86,nordc,cuda12.6 -Mcuda $(DEFS) -fPIC -Wall

The error points to a function in a header file “proto.h” at line 168. The function is defined as:

   template < typename type > void setvgpu( Int ist, Int ien, Int n, type val, type *sdata, Int nq )
  {
      Int i,iq;
      #pragma acc parallel loop gang vector\
       present(sdata[0:n*nq]) \
       default(none)
      for(iq=ist; iq<ien; iq++)
     {
         for( i=0;i<n;i++ )
        {
            sdata[ADDR(i,iq,nq)] = val;
        }
    }
 };

What I am really confused is that I am compiling with “-gpu=cc86”, but the error says “Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86”.

I would be very grateful if anyone could offer some advice.

Many thanks,
Feng

Hi Feng,

Normally this error means that the binary wasn’t compiled for this target device, but here I think your encountering a regression in 24.9 when using the “nordc” option. It took a bit of effort, but I found I could reproduce the error in a few of our OpenACC C++ unit tests.

The error doesn’t occur in our 24.7 or the pre-release 24.11 (which will be released in the near future), meaning that it looks engineer already found and fixed the problem.

So assuming that I’m reproducing your error correctly, you can either downgrade to 24.7, wait for 24.11, or remove “nordc” from your compiler flags.

If you provide a reproducing example, I can test against 24.11 to ensure your test is fixed as well.

-Mat

Hi Mat,

Thanks for your reply.

I have downgraded to 24.7 and my code is running. so there is something dodgy with 24.9. My code is a large c++ library, I am not sure how I could make a simple reproducing example…

Besides, I am compiling with “-Mcuda”, when I compile the code, I got the warning message which says:

nvc++-Warning-The flag -Mcuda has been deprecated, please use -cuda and -gpu instead.

If I use “-cuda”, my code crashes. The error message is:

[feng:41560:0:41560] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4dfa27 vs 0x3ed8f8)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4dfa27 vs 0x3ed8f8)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4dfa27 vs 0x3ed8f8)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4dfa27 vs 0x3ed8f8)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4dfa27 vs 0x3ed8f8)
BFD: /lib/x86_64-linux-gnu/libc.so.6: unknown type [0x13] section `.relr.dyn'
BFD: DWARF error: section .debug_info is larger than its filesize! (0x4dfa27 vs 0x3ed8f8)
==== backtrace (tid:  41560) ====
 0 0x0000000000045320 __sigaction()  ???:0
 1 0x000000000000f446 __pgi_uacc_cuda_enter()  /proj/build/24A/Linux_x86_64/rte/accel- 
   uni/build/Linux_x86_64/../../src/cuda_enter.c:221
 2 0x00000000000c10a2 cFdDomain::resd()  /home/gpu/xcode/lib/org/src/domain/cfd/resd.cpp:34
 3 0x00000000000dc220 cFdDomain::smooth3()  /home/gpu/xcode/lib/org/src/domain/cfd/smooth.cpp:116
 4 0x00000000000bb90c cFdDomain::comp3()  /home/gpu/xcode/lib/org/src/domain/cfd/comp.cpp:155
 5 0x000000000004a429 cDom::compute()  /home/gpu/xcode/lib/org/src/device/worker/dom/dom.cpp:150
 6 0x0000000000042d2c cDevice::compute()  /home/gpu/xcode/lib/org/src/device/device.cpp:270
 7 0x000000000040af64 compute()  /home/gpu/xcode/dev/src/compute.cpp:88
 8 0x0000000000405d40 main()  /home/gpu/xcode/dev/src/main.cpp:54
 9 0x0000000000405d40 main()  /home/gpu/xcode/dev/src/main.cpp:57
10 0x000000000002a1ca __libc_init_first()  ???:0
11 0x000000000002a28b __libc_start_main()  ???:0
12 0x00000000004055d5 _start()  ???:0

The code it points to is:

  for( iv=0;iv<nv;iv++ )
 {
     tmp = 0;
    #pragma acc parallel loop\
     present(srhs[0:nv*nq],this) \
     reduction(+:tmp) \
     default(none)
     for( iq=iqs;iq<iqe;iq++ )
    {
        //r1[iv]+= abs( r[iv][iq] );
        tmp+= abs( srhs[ADDR(iv,iq,nq)] );
    }
     r1[iv] = tmp;
 }

Is there a way to get rid of this warning message, and will “-Mcuda” be removed from the future release?

Many thanks,
Feng

No, the warning can’t be disabled. Engineering wants to make sure users know of these types of changes so we don’t break your build system once the flag is removed. I’m not sure exactly when they’ll remove it, but typically they do these in the first release of the year.

“_pgi_uacc_cuda_enter” is an initialization routine so the error doesn’t have anything to do with your OpenACC code. It, along with other init routines, get called once the first OpenACC construct is encountered. It does things like creating or attaching to a CUDA context, registering kernels, and initializing the device data.

The particular area where the error occurs is specific to nordc, where it’s loading the PTX JIT files. Why it’s failing and why the error handler in there isn’t catching it, I’m not sure. Might be related to the earlier issue?

Assuming final testing goes well over the weekend, 24.11 will be out soon. So we can wait till then and try again. Though if it still fails, I’ll need a reproducer to determine what’s wrong and be able to report it. Full application is fine, assuming you’re allowed to share. If you don’t want to post it publicly, feel free to message me directly and we can arrange a method for to you get it to me.

Is there a reason why you’re needing to use nordc, such as you’re using shared objects or interoperating with CUDA? If not, try compiling without nordc.

Hi Mat,

Thanks for your reply.

My code is organized as a collection of “.so” libraries, so I would need to use “nordc”. Based on your comments, if I organize the libraries as static ones (i.e. *.a), this could fix the issue?

The code is not small, but I think I could process the code a bit and share it with you. how could I pass it to you?

Many thanks,
Feng

coming back to this thread, hope this is useful to people having similar issues.

For my first problem, I need to downgrade to HPC_SDK 24.7, maybe 24.11 might work, but I have not tried yet.

For my 2nd problem related to “-Mcuda” and “-cuda”, I only need to put “-cuda” at the last stage when I link the libraries and create the executable. This will remove the warning message related to “-Mcuda” and the code compiles with “-cuda” and runs with no issues.

Thanks again,
Feng

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.