Openacc, command exited with non_zero status 1

Hello,

I use nvfortran compiler in my code for a while. Yet after an update I did yesterday on my computer on the terminal (sudo apt-get update), the codes which are ccompiled with openACC flags gave error. Also when I run nvidia-smi on the terminal it gives the message “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.” (I do not know whether it is relevant.). I could not figure out a way to solve the problem.

A sample code that that had been working until yesterday.

module generator

implicit none

contains
subroutine init_diag_dom_mat(A)
    
    real*8, intent(out), dimension(:,:) :: A
    integer :: i,j,nsize
    real*8 :: sum, x
    
    nsize = ubound(A,1)
    
    do i = 1, nsize
        sum = 0
        do j = 1, nsize
            call random_number(x)
            x = mod(x, 23.0d0) / 1000.0d0
            A(j,i) = x
            sum = sum + x
        end do
        
        A(i,i) = A(i,i) + sum
        
        ! in order make it like identity matrix 
        do j = 1, nsize
            A(j,i) = A(j,i) / sum
        end do
    end do
end subroutine

end module generator

program main
use generator
use omp_lib
implicit none

integer :: nsize, i, j, iters, max_iters, riter
real*8, allocatable :: A(:,:), b(:)
real*8, allocatable, target :: x1(:), x2(:)
real*8, pointer, contiguous :: xnew(:), xold(:), xtmp(:)
real*8 :: r, residual, rsum, dif, err, chksum

real*8, parameter :: TOLERANCE = 0.00000001
real*8 :: start_time, elapsed_time

nsize = 10
max_iters = 1000000000
riter = 10000000

allocate(A(nsize,nsize))
allocate(b(nsize), x1(nsize), x2(nsize))

! configuration of the matrix A
call init_diag_dom_mat(A)

! configuration of the vectors x1, x2, b
x1 = 0
x2 = 0
do i = 1, nsize
    call random_number(r)
    b(i) = mod(r, 51.0d0) / 100.0d0
end do 

start_time = omp_get_wtime()

residual = TOLERANCE + 1.0d0         ! + 1.0d0 is put to meet the while condition at the first step
iters = 0

! swap these in each iteration
xnew => x1
xold => x2

!$acc data copyin(A(:,:), b(:)) copy(x1(:), x2(:))
do while(residual > TOLERANCE .and. iters < max_iters)
    iters = iters + 1
    
    ! swap of input and output vectors
    xtmp => xnew
    xnew => xold
    xold => xtmp
    
    !$acc parallel loop private(rsum) async
    do i = 1, nsize
        rsum = 0
        !$acc loop reduction(+:rsum)
        do j = 1, nsize
            if ( i /= j ) rsum = rsum + A(j,i) * xold(j)
        end do
        xnew(i) = (b(i) - rsum) / A(i,i)
    end do
    
    residual = 0
    !$acc parallel loop reduction(+:residual) private(dif) async
    do i = 1, nsize
        dif = xnew(i) - xold(i)
        residual = residual + dif * dif
    end do
    !$acc wait
    residual = sqrt(residual)
    if( mod(iters, riter) == 0) write (*,*) "Iteration", iters, ", & residual is", residual
end do
!$acc end data
elapsed_time = omp_get_wtime() - start_time
write (*,*) "Converged after ", iters, " iterations"
write (*,*) "            and ", elapsed_time, " seconds"
write (*,*) "    residual is ", residual

deallocate(A, b, x1, x2)

end program main

This is the Makefile:

FC=nvfortran
TIMER=/usr/bin/time
OPT=
NOPT=-fast -Minfo=opt $(OPT)

jacobi_acc: jacobi_acc.o
$(TIMER) ./jacobi_acc.o $(STEPS)
jacobi_acc.o: jacobi_acc.f90
$(FC) -o $@ $< $(NOPT) -ta:tesla:cc75 -Minfo=accel

clean:
rm -f *.o *.exe *.s *.mod a.out

When I compile and run it, it gives the error:

/usr/bin/time ./jacobi_acc.o
Current file: /home/yunus/Desktop/Parallel/jacobi_acc/jacobi_acc.f90
function: main
line: 84
This file was compiled: -ta=tesla:cc70,cc75
Command exited with non-zero status 1
0.03user 0.03system 0:00.09elapsed 77%CPU (0avgtext+0avgdata 10008maxresident)k
0inputs+0outputs (0major+3330minor)pagefaults 0swaps
make: *** [Makefile:10: jacobi_acc] Error 1

Line 84 is !$acc parallel loop private(rsum) async.

In another code that works fine with nvfortran without OpenACC flags, the errors with OpenACC flags are:

/usr/bin/ld: /home/yunus/Desktop/000QAGE/OBJ/Modules/mod_fem.o: in function mod_fem_copy2device3_': /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:29: undefined reference to __pgi_uacc_dataenterstart2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:29: undefined reference to __pgi_uacc_dataonb' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:29: undefined reference to __pgi_uacc_dataenterdone’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:30: undefined reference to __pgi_uacc_dataenterstart2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:30: undefined reference to __pgi_uacc_dataonb’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:30: undefined reference to __pgi_uacc_dataenterdone' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:32: undefined reference to __pgi_uacc_dataenterstart2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:32: undefined reference to __pgi_uacc_dataonb' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:32: undefined reference to __pgi_uacc_dataenterdone’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/OBJ/Modules/mod_fem.o: in function mod_fem_copy2host3_': /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:41: undefined reference to __pgi_uacc_dataexitstart2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:41: undefined reference to __pgi_uacc_dataoffb2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:41: undefined reference to __pgi_uacc_dataexitdone’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:43: undefined reference to __pgi_uacc_dataexitstart2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:43: undefined reference to __pgi_uacc_dataoffb2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:43: undefined reference to __pgi_uacc_dataexitdone' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:44: undefined reference to __pgi_uacc_dataexitstart2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:44: undefined reference to __pgi_uacc_dataoffb2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:44: undefined reference to __pgi_uacc_dataexitdone’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/OBJ/Modules/mod_fem.o: in function mod_fem_copy2device2_': /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:51: undefined reference to __pgi_uacc_dataenterstart2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:51: undefined reference to __pgi_uacc_dataonb' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:51: undefined reference to __pgi_uacc_dataenterdone’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:52: undefined reference to __pgi_uacc_dataenterstart2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:52: undefined reference to _pgi_uacc_dataonb’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:52: undefined reference to __pgi_uacc_dataenterdone' /usr/bin/ld: /home/yunus/Desktop/000QAGE/OBJ/Modules/mod_fem.o: in function mod_fem_copy2host2
’:
/home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:59: undefined reference to __pgi_uacc_dataexitstart2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:59: undefined reference to __pgi_uacc_dataoffb2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:59: undefined reference to __pgi_uacc_dataexitdone' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:60: undefined reference to __pgi_uacc_dataexitstart2’
/usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:60: undefined reference to __pgi_uacc_dataoffb2' /usr/bin/ld: /home/yunus/Desktop/000QAGE/SRC/Modules/mod_fem.f90:60: undefined reference to __pgi_uacc_dataexitdone’
make[1]: *** [Makefile:28: all] Error 2

Also pgaccelinfo | less says No accelerators found.

Thanks,
Y

You have two different issues here. The first one has to do with the NVIDIA kernel driver not getting loaded (so you can’t run the OpenACC version on GPU), and the second one has to do with the compilation using nvfortran.

I’ve ran into the second one in the past. The linker error messages that begin with

undefined reference to __pgi_uacc

are due to the OpenACC runtime functions getting renamed from pgi to nv that happened sometime around the release of either NVHPC 20.9 or 20.11. The solution was surprisingly simple: make sure to perform make clean before re-compiling!

2 Likes

Should I reupload or update the NVIDIA driver in order to solve the first problem?

For the first problem, I’m guessing that the NVIDIA driver package somehow got uninstalled the last time you updated your system. Since you mentioned apt… which distro are you using (Ubuntu? Debian? Linux Mint?), and what version?

It is Ubuntu 20.04.2 LTS.

Can you check how the NVIDIA graphics driver is installed on your system – is it through Ubuntu repos or through NVIDIA repos? There are several different ways to check this (open “Additional Drivers”, using dpkg -l nvidia-driver-* in a terminal window, checking the contents of /etc/apt/sources.list to see if the NVIDIA repos are listed there, etc.)

If you installed CUDA Toolkit on your system, then the second way (NVIDIA repos) is guaranteed.

Also, which NVIDIA GPU do you have? Different driver versions support a different range of devices, and this might be important (you can’t go past a certain version number if you happen to have an older GPU. such as GeForce 8 series or 9 series… I forgot the exact number)

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
++±====================-==================-============-=================================
ii nvidia-driver-455 455.32.00-0ubuntu1 amd64 NVIDIA driver metapackage
un nvidia-driver-binary (no description available)

The version of Cuda is:

nvfortran 20.9-0 LLVM 64-bit target on x86-64 Linux -tp haswell
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

I have NVIDIA Corporation TU116M [GeForce GTX 1660 Ti Mobile]

Based on the info you posted, it looks like the NVIDIA driver version 455 is installed from Ubuntu repos (notice the version string having ubuntu in it). That’s totally fine for Turing (GTX 16 series).

Is the package nvidia-dkms-455 also installed? That might be the missing link if it failed to compile for your kernel.

20.9 is the version of nvfortran, not the CUDA toolkit, but if I remember correctly, NVHPC 20.9 comes with CUDA 10.1. But let’s not worry about that until you can be sure that the kernel module is there (so nvidia-smi and pgiaccelinfo works again)

Nvidia-dkms-455 is installed:

Package: nvidia-dkms-455
Status: install ok installed
Priority: optional
Section: non-free/libs
Installed-Size: 129
Maintainer: Ubuntu Core Developers ubuntu-devel-discuss@lists.ubuntu.com
Architecture: amd64
Multi-Arch: foreign
Source: nvidia-graphics-drivers-455
Version: 455.32.00-0ubuntu1
Replaces: nvidia-384 (<< 390.25), nvidia-dkms-kernel, nvidia-kernel-source-455 (<< 390.25-0ubuntu2~)
Provides: nvidia-dkms-kernel
Depends: dkms, nvidia-kernel-source-455, nvidia-kernel-common-455
Breaks: nvidia-kernel-source-455 (<< 390.25-0ubuntu2~)
Conflicts: nvidia-dkms-kernel
Description: NVIDIA DKMS package
This package builds the NVIDIA kernel module needed by the userspace
driver, using DKMS.
Provided that you have the kernel header packages installed, the kernel
module will be built for your running kernel, and automatically rebuilt for
any new kernel headers that are installed.

Hmm, so the files are there. Can you post the output of lsmod | grep nvidia to verify that the kernel modules are loaded? If not, we’ve hit the jackpot – you need to re-install the NVIDIA driver, ensure the kernel headers are installed, ensure the kernel modules are compiled properly, and nothing is preventing them from being loaded (e.g. check for blacklisting in /etc/modprobe.d/; proper usage/avoidance of nomodeset).

It is i2c_nvidia_gpu 16384 0

There is no additional info. That’s all written under lsmod | grep nvidia

That confirms that the kernel modules necessary for CUDA and OpenACC operations are not loaded. If they’re loaded, you should see more output like nvidia_uvm or nvidia_drm.

Now, can you check:

  • if you have the kernel headers installed? If not, sudo apt install linux-headers-generic should fix that.
  • the output of dkms status? If it’s empty, then you can try sudo dpkg-reconfigure nvidia-dkms-455 with kernel headers installed to try to rebuild the kernel modules.

I installed the headers, yet when reconfiguring it, it gave the error:

Removing all DKMS Modules
Done.
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
/usr/sbin/initramfs -u


*** Reboot your computer and verify that the NVIDIA graphics driver can ***
*** be loaded. ***


INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
Loading new nvidia-455.32.00 DKMS files…
Building for 5.11.0-25-generic
Building for architecture x86_64
Building initial module for 5.11.0-25-generic
ERROR: Cannot create report: [Errno 17] File exists: ‘/var/crash/nvidia-dkms-455.0.crash’
Error! Bad return status for module build on kernel: 5.11.0-25-generic (x86_64)
Consult /var/lib/dkms/nvidia/455.32.00/build/make.log for more information.

I see you’re using the Hardware Enablement (HWE) kernel, which is recently upgraded to version 5.11. The first NVIDIA driver version that is compatible with this kernel version is 460, according to this Phoronix article. Try upgrading to that driver version using sudo apt install nvidia-driver-460.

To summarize, what happened was the HWE kernel update from 5.8 to 5.11 broke the NVIDIA kernel driver, version 455.

1 Like

It gives back:

Reading package lists… Done
Building dependency tree
Reading state information… Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
nvidia-driver-460 : Depends: libnvidia-gl-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Depends: libnvidia-extra-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Depends: libnvidia-decode-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Depends: libnvidia-encode-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Depends: xserver-xorg-video-nvidia-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Depends: libnvidia-cfg1-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Depends: libnvidia-ifr1-460 (= 460.91.03-0ubuntu0.20.04.1) but it is not going to be installed
Recommends: libnvidia-decode-460:i386 (= 460.91.03-0ubuntu0.20.04.1)
Recommends: libnvidia-encode-460:i386 (= 460.91.03-0ubuntu0.20.04.1)
Recommends: libnvidia-ifr1-460:i386 (= 460.91.03-0ubuntu0.20.04.1)
Recommends: libnvidia-fbc1-460:i386 (= 460.91.03-0ubuntu0.20.04.1)
Recommends: libnvidia-gl-460:i386 (= 460.91.03-0ubuntu0.20.04.1)
E: Unable to correct problems, you have held broken packages.

Also in the installation of the headers, it says that:

The following packages were automatically installed and are no longer required:
libllvm11 linux-headers-5.4.0-58 linux-headers-5.4.0-58-generic shim

Looks like manual installation with apt is gonna get real messy. You can revert the broken packages with sudo apt -f install.

Since now we know what’s going on, I think you have two choices:

  • Keep the HWE Linux kernel version at 5.11, and upgrade the NVIDIA driver to 460 or newer. Try using “Additional Drivers” for this approach.
  • Keep the NVIDIA driver version at 455, and downgrade the Linux kernel to non-HWE (also called GA or general availability) version 5.4. Use the instructions at the Ubuntu wiki.

Good luck!

Thanks a lot! I will go with the first. Have a nice day.

Hi wyphan, I am having the second problem. Specifically, I am getting undefined symbol: __pgi_uacc_dataoffb2 when importing my f2py module in Python. Could you specify where should I perform make clean? Thank you!

Most likely you need to reinstall f2py package in Python. If using pip, use the commands posted here, and if using conda use this, substituting f2py for numpy.