OpenACC Region: Command exited with non-zero status 1

Hi,

When I compile and run a program on my computer with GeForce 1660Ti, it works fine. Yet when I compile the same code on a remote computer with Quadro GV100 (I just changed -ta=tesla:cc75 to cc70), it gives the following error

Current file: /home/yunus/openacc/jacobi_acc.f90
function: main
line: 89
This file was compiled: -ta=tesla:cc70
Command exited with non-zero status 1
0.02user 0.00system 0:00.05elapsed 64%CPU (0avgtext+0avgdata 14184maxresident)k
0inputs+0outputs (0major+999minor)pagefaults 0swaps
make: *** [Makefile:10: jacobi_acc] Error 1

The code is

module generator

implicit none

contains
subroutine init_diag_dom_mat(A)
    
    real*4, intent(out), dimension(:,:) :: A
    integer :: i,j,nsize
    real*4 :: sum, x
    
    nsize = ubound(A,1)
    
    do i = 1, nsize
        sum = 0
        do j = 1, nsize
            call random_number(x)
            x = mod(x, 23.0) / 1000.0
            A(j,i) = x
            sum = sum + x
        end do
        
        A(i,i) = A(i,i) + sum
        
        ! in order make it like identity matrix 
        do j = 1, nsize
            A(j,i) = A(j,i) / sum
        end do
    end do
end subroutine
end module generator

program main

use generator
use omp_lib
implicit none

integer :: nsize, i, j, iters, max_iters, riter
real*4, allocatable :: A(:,:), b(:)
real*4, allocatable, target :: x1(:), x2(:)
real*4, pointer, contiguous :: xnew(:), xold(:), xtmp(:)
real*4 :: r, residual, rsum, dif, err, chksum

real*4, parameter :: TOLERANCE = 0.00000000000000000000000001
real*8 :: start_time, elapsed_time

nsize = 600

write(*,*) "nsize", nsize

! CONSTANTS--------------------------------------------------------
max_iters = 100000
riter = 10000000
! -----------------------------------------------------------------

allocate(A(nsize,nsize))
allocate(b(nsize), x1(nsize), x2(nsize))

! configuration of the matrix A
call init_diag_dom_mat(A)

! configuration of the vectors x1, x2, b
x1 = 0
x2 = 0
do i = 1, nsize
    call random_number(r)
    b(i) = mod(r, 51.0) / 100.0
end do 

residual = TOLERANCE + 1.0         ! + 1.0d0 is put to meet the while condition at the first step
iters = 0

! swap these in each iteration
xnew => x1
xold => x2

start_time = omp_get_wtime()

!$acc data copyin(A(:,:), b(:)) copy(x1(:), x2(:))
do while(residual > TOLERANCE .and. iters < max_iters)
    iters = iters + 1
    
    ! swap of input and output vectors
    xtmp => xnew
    xnew => xold
    xold => xtmp
    
    !$acc parallel loop private(rsum) async
    do i = 1, nsize
        rsum = 0
        !$acc loop reduction(+:rsum)
        do j = 1, nsize
            if ( i /= j ) rsum = rsum + A(j,i) * xold(j)
        end do
        xnew(i) = (b(i) - rsum) / A(i,i)
    end do
    
    residual = 0
    !$acc parallel loop reduction(+:residual) private(dif) async
    do i = 1, nsize
        dif = xnew(i) - xold(i)
        residual = residual + dif * dif
    end do
    !$acc wait
    residual = sqrt(residual)
    if( mod(iters, riter) == 0) write (*,*) "Iteration", iters, ", & residual is", residual
end do
!$acc end data
elapsed_time = omp_get_wtime() - start_time
write (*,*) "Converged after ", iters, " iterations"
write (*,*) "            and ", elapsed_time, " seconds"
write (*,*) "    residual is ", residual

deallocate(A, b, x1, x2)

end program main

and the Makefile is

FC=nvfortran
TIMER=/usr/bin/time
OPT=
NOPT=-fast -Minfo=opt $(OPT)

jacobi_acc: jacobi_acc.o
$(TIMER) ./jacobi_acc.o $(STEPS)
jacobi_acc.o: jacobi_acc.f90
$(FC) -o $@ $< $(NOPT) -ta:tesla:cc70 -Minfo=accel -acc

clean:
rm -f *.o *.exe *.s *.mod a.out

Should I add any other command to run the code?

Thanks

Hi yunus.altintop.2,

I think the code is fine, but rather it’s more likely an issue with the remote system.

The error is occurring at line 89 which is the first compute region and seems to indicate that either the code was not compiled for this device or the runtime could not open libcuda.so. (It’s been awhile, but I’ve seen cases where only the OpenCL driver was installed not the CUDA driver)

Can you post the output from the command “nvaccelinfo” (or “pgaccelinfo” if you’re using a PGI branded compiler).

This will tell us if the compiler runtime can attach to the device, if there’s another device on the system, and what the CUDA driver version is.

-Mat

CUDA Driver Version: 11040
NVRM version: NVIDIA UNIX x86_64 Kernel Module
470.57.02 Tue Jul 13 16:14:05 UTC 2021

Device Number: 0
Device Name: Quadro GV100
Device Revision Number: 7.0
Global Memory Size: 34087305216
Number of Multiprocessors: 80
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1627 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 850 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 6291456 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
Default Target: cc70

Device Number: 1
Device Name: Quadro P620
Device Revision Number: 6.1
Global Memory Size: 2095382528
Number of Multiprocessors: 4
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1354 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2505 MHz
Memory Bus Width: 128 bits
L2 Cache Size: 524288 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
Default Target: cc61

The installed tools are below. I am not sure whether it is helpful or not but these are what I installed to the computer.



-Yunus

Hi Yunus,

Since you got output from nvaccelinfo, this means the CUDA driver installation is fine. So most likely the issue is that the device code is being run on the P620 rather than the GV100. Although the device enumeration is correct in the nvaccelinfo output, the CUDA driver may be reversing these on launch.

To test this theory, try setting the environment “CUDA_VISIBLE_DEVICE=1” to see if that fixes the issue.

-Mat

I set the environment “CUDA_VISIBLE_DEVICE=1” but still it does not work, gives the same error.

/usr/bin/time ./jacobi_acc.o
nsize 600
Current file: /home/yunus/openacc/jacobi_acc.f90
function: main
line: 89
This file was compiled: -ta=tesla:cc70
Command exited with non-zero status 1
0.02user 0.00system 0:00.03elapsed 96%CPU (0avgtext+0avgdata 14432maxresident)k
0inputs+0outputs (0major+1002minor)pagefaults 0swaps
make: *** [Makefile:10: jacobi_acc] Error 1

-Yunus

Hmm, ok, let’s try “CUDA_VISIBLE_DEVICE=0”. Also, maybe try compiling with “-ta=tesla:cc60,c70”?

The error message means that the device binary wasn’t built for the device’s architecture and why I’m still leaning towards the problem being it being run on the wrong device. Also, you may want to look at the output from “nvidia-smi” so we can what the CUDA driver device enumeration is.

I tried with both CUDA_VISIBLE_DEVICE=1 and =0 for cc70, cc61, and cc60 but it gave the same error:

nvfortran -o jacobi_acc.o jacobi_acc.f90 -fast -Minfo=opt -ta:tesla:cc60 -Minfo=accel -acc
init_diag_dom_mat:
27, Zero trip check eliminated
main:
64, Memory zero idiom, loop replaced by call to __c_mzero4
65, Memory zero idiom, loop replaced by call to __c_mzero4
80, Generating copyin(a(z_b_0:z_b_1,z_b_3:z_b_4),b(:)) [if not already present]
Generating copy(x2(z_b_15:z_b_16),x1(z_b_11:z_b_12)) [if not already present]
89, Generating Tesla code
90, !$acc loop gang ! blockidx%x
93, !$acc loop vector(128) ! threadidx%x
Generating reduction(+:rsum)
89, Generating implicit copyin(xold(1:600)) [if not already present]
Generating implicit copyout(xnew(1:600)) [if not already present]
93, Loop is parallelizable
100, Generating Tesla code
101, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
Generating reduction(+:residual)
100, Generating implicit copyin(xold(1:600),xnew(1:600)) [if not already present]
Generating implicit copy(residual) [if not already present]
/usr/bin/time ./jacobi_acc.o
nsize 600
Current file: /home/yunus/openacc/jacobi_acc.f90
function: main
line: 89
This file was compiled: -ta=tesla:cc60
Command exited with non-zero status 1
0.02user 0.00system 0:00.03elapsed 96%CPU (0avgtext+0avgdata 14428maxresident)k
0inputs+0outputs (0major+1000minor)pagefaults 0swaps
make: *** [Makefile:10: jacobi_acc] Error 1

The result of NVIDIA-SMI is

Thanks for helping by the way,

-Yunus

Sorry, I’m not sure what’s happening then. Though maybe try setting the environment variable “NVCOMPILER_ACC_DEBUG=1”. Not sure it will help determine the issue, but worth a look to see what the runtime debug info says.

When I compiled after the environment variable, it gave as:

ACC: device[1] is PGI native
pinitialize (threadid=1)
curr_devid for threadid=1 is 0
pgi_uacc_dataenterstart( file=/home/yunus/openacc/jacobi_acc.f90, function=main, line=34:34, line=80, devid=0,threadid=1 )
curr_devid for threadid=1 is 0
pgi_uacc_dataon(hostptr=0x7fb8dd45d020,stride=1,600,size=600x600,eltsize=4,lineno=80,name=a,flags=0x700=present+create+copyin,async=-1,threadid=1)
curr_devid for threadid=1 is 0
dataon - running on the host
pgi_uacc_dataon(hostptr=0xa50520,stride=1,size=600,eltsize=4,lineno=80,name=b,flags=0x700=present+create+copyin,async=-1,threadid=1)
curr_devid for threadid=1 is 0
dataon - running on the host
pgi_uacc_dataon(hostptr=0xa50eb0,stride=1,size=600,eltsize=4,lineno=80,name=x1,flags=0xf00=present+create+copyin+copyout,async=-1,threadid=1)
curr_devid for threadid=1 is 0
dataon - running on the host
pgi_uacc_dataon(hostptr=0xa51840,stride=1,size=600,eltsize=4,lineno=80,name=x2,flags=0xf00=present+create+copyin+copyout,async=-1,threadid=1)
curr_devid for threadid=1 is 0
dataon - running on the host
pgi_uacc_dataenterdone(devid=0,threadid=1)
pgi_uacc_enter( devid=0 )
curr_devid for threadid=1 is 0

Actually after setting “CUDA_VISIBLE_DEVICE=0”, the “ACC: device[1] is PGI native” does not change to device[0]. Is it a problem or something normal?

-Yunus

That’s normal since it’s Fortran which is enumerated from 1 to N. The “devid=0” is the CUDA device enumeration.

The output is interesting, not for what’s shown, but for what’s missing. It should show the device initialization. Something like:

ACC: detected 4 CUDA devices
cuda_initdev thread:0 data.default_device_num:0 pdata.cuda.default_device_num:0
ACC: device[1] is NVIDIA CUDA device 0 compute capability 7.0
ACC: device[2] is NVIDIA CUDA device 1 compute capability 7.0
ACC: device[3] is NVIDIA CUDA device 2 compute capability 7.0
ACC: device[4] is NVIDIA CUDA device 3 compute capability 7.0
ACC: initialized 4 CUDA devices
ACC: device[5] is PGI native
pinitialize (threadid=1)
cuda_init_device thread:1 data.default_device_num:1 pdata.cuda.default_device_num:1
cuda_init_device(threadid=1, device 0) dindex=1, api_context=(nil)
cuda_init_device(threadid=1, device 0) dindex=1, setting api_context=(nil)
cuda_init_device(threadid=1, device 0) dindex=1, new api_context=0xf91450
argument memory for queue 32 device:0x155237a00000 host:0x155237c00000

For some reason, the runtime isn’t detecting the CUDA devices and instead using the host (i.e. “PGI native”). Though the same code used by nvaccelinfo is used here, so I can’t explain why. If nvaccelinfo works, this should as well. Very odd.

I think at this point, I’ll need to ask one of our compiler engineers for ideas, but he’s out today so it may not be till next week.

My only other thought is to run your binary through strace and see which libcuda.so (i.e. the CUDA driver) is being used and comparing to what’s used by nvaccelinfo. Maybe it’s picking up a different one?

% strace a.out > & log
% grep libcuda.so log
openat(AT_FDCWD, “/proj/nv/Linux_x86_64/dev/compilers/lib/libcuda.so”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/proj/nv/Linux_x86_64/dev/compilers/lib/libcuda.so”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/proj/nv/Linux_x86_64/dev/comm_libs/openmpi4/openmpi-4.0.5/lib/libcuda.so”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/usr/lib/x86_64-linux-gnu/libcuda.so”, O_RDONLY|O_CLOEXEC) = 3

% strace nvaccelinfo > & log
% grep libcuda.so log
openat(AT_FDCWD, “/proj/nv/Linux_x86_64/dev/compilers/lib/libcuda.so”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/proj/nv/Linux_x86_64/dev/comm_libs/openmpi4/openmpi-4.0.5/lib/libcuda.so”, O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, “/usr/lib/x86_64-linux-gnu/libcuda.so”, O_RDONLY|O_CLOEXEC) = 3
read(3, “ib/x86_64-linux-gnu/libcuda.so.4”…, 1024) = 1024

1 Like

Hi,

When I compile and run the code as a root instead of user, it gives the correct result.

/usr/bin/time ./jacobi_acc.o
nsize 600
Converged after 100000 iterations
and 2.369217157363892 seconds
residual is 1.0496981E-07
2.15user 0.19system 0:02.52elapsed 93%CPU (0avgtext+0avgdata 233680maxresident)k
10080inputs+0outputs (7major+39543minor)pagefaults 0swaps

It seems it is about that. Is there any authorization to arrange the use of GPUs for different types of users?

-Yunus

Interesting. I’ll need to remember this if others encounter the same issue.

What’s the permission on the Linux device files, i.e. “ls -l /dev/nvidia*”?

It is for a user:

crw-rw----+ 1 root video 195, 0 Sep 23 11:15 /dev/nvidia0
crw-rw----+ 1 root video 195, 1 Sep 23 11:15 /dev/nvidia1
crw-rw----+ 1 root video 195, 255 Sep 23 11:15 /dev/nvidiactl
crw-rw----+ 1 root video 195, 254 Sep 23 11:15 /dev/nvidia-modeset
crw-rw-rw-+ 1 root root 237, 0 Sep 23 11:15 /dev/nvidia-uvm
crw-rw-rw-+ 1 root root 237, 1 Sep 23 11:15 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 240, 1 Sep 23 11:29 nvidia-cap1
cr–r–r-- 1 root root 240, 2 Sep 23 11:29 nvidia-cap2

Actually it is the same result that I got from the root.

-Yunus

This could be the issue since users not in the ‘video’ group don’t have access to the devices. Try setting (as root), the permission to include world read-write access (i.e “chmod 666 /dev/nvidia0”)

Note that this will most likely get clobbered after a reboot, so you may need to add this to a rc file.

1 Like

Okay I will try it. It seems it is the issue.
Thanks.

-Yunus

It did not solve the problem either, still gives the same error.

Ok, though this is starting to get beyond my area of expertise. Let me ask our IT folks for ideas.

Thanks