Catching signals in nvfortran applications

I have a large legacy codebase, that I’m trying to enable signal catching for fault tolerance in the cloud. I have found several examples that I can get to compile and work with gfortran, but i repeatedly get segmentation faults when i send a ctl-c SIGINT when running with an nvfortran compiled binary.

I found an older example on the old PGI section and it almost works. However, it seems to have a race condition. If i hit ctl-C within the first few seconds, i get a segmentation fault the same as my legacy codebase. However, if i send ctl-c after it’s running a few seconds, the signal is caught and the program exits as expected.

!nvfortran signal_test.f

      PROGRAM SIGTEST
      IMPLICIT NONE
      INTEGER IRETW, SIGINT, SIGNAL
      EXTERNAL PROCEDURE

      SIGINT=2
      IRETW = SIGNAL(SIGINT, PROCEDURE)
      WRITE(*,*) 'Press Ctrl-C to exit'
      
      DO WHILE(.true.)
      END DO

      END

      SUBROUTINE PROCEDURE()
      WRITE(*,*) 'Closing Programs'
      STOP
      RETURN

      END SUBROUTINE

I’ve also found, that if I add “implicit none” to the 2nd line of the subroutine, it will segfault 100% of the time. When it does segmentation fault, I get the following:

# ./a.out
 Press Ctrl-C to exit
^CError: segmentation violation, address not mapped to object
Attaching to process 863
ptrace: Operation not permitted

And if I run it within gdb:

(cuda-gdb) signal 2
Continuing with signal SIGINT.

Program received signal SIGSEGV, Segmentation fault.
0x0000000000401394 in procedure () at signal_test.f:18
18	      WRITE(*,*) 'Signal ', SIGNUM, ' received'
(cuda-gdb) bt
#0  0x0000000000401394 in procedure () at signal_test.f:18
#1  <signal handler called>
#2  sigtest () at signal_test.f:11

Which is similar (albeit with different addresses) to the legacy code backtrace. It seems that perhaps the compiler is missing a link to the external subroutine for “Signal()”?

Thanks in advance.

Hi Eric20,

Keep in mind that “signal” is a GNU extension and as far as I’m aware, there’s not a standard way to perform signal handling. We do support “signal” via our old F77 “lib3f” interface which is basically just a wrapper around the C “signal” call.

The key here is that you need to set the environment variable “PGI_TERM” to “signal”. PGI_TERM basically tells the runtime how it should handle exceptions with “signal” saying use the user signal handler not the runtime’s default.

For example:

% nvfortran signal_test.f ; a.out
 Press Ctrl-C to exit
^CSegmentation fault (core dumped)
% export PGI_TERM=signal
% a.out
 Press Ctrl-C to exit
^C Closing Programs
FORTRAN STOP

Hope this helps,
Mat

Mat,

Thanks for the reply. Unfortunately, the PGI_SIGNAL env variable didn’t seem to affect behavior. I did more experimentation and also found some old documentation that helped figure out the behavior.

Example 1

CALL SIGNAL(SIGINT, PROC, -1)

or Example 2

INTEGER IRETW, SIGINT, SIGNAL
SIGINT=2
IRETW = SIGNAL(SIGINT, PROC,-1)

Key points:

  • “SIGNAL” in the function format (second example) needs to be declared an INTEGER.
  • The function format requires a return value, even if you don’t need it.
  • Most importantly, both need a “-1” flag passed to the call/function. It appears to not be optional in nvfortran, and omitting leads to strange behavior and segmentation faults.

The flag presence seems to vary by compiler, with gfortran not compiling it, and other compilers seem to make it optional.

I hope this helps someone in the future.

Eric

Hi Eric,

I’m not sure if this is a typo, but the env var is “PGI_TERM=signal” not “PGI_SIGNAL”.

As shown above, your example works correctly for me when using PGI_TERM.

For the implementation of signal, the third “flag” argument is optional. Passing a “-1” is the same as not passing a flag argument since the three argument call to the C signal routine is only called when the flag is greater or equal to 0. The use of PGI_TERM=signal is still required even when passing a flag since this changes the runtime behavior from using the runtimes signal handler to the program’s signal handler.

-Mat

Mat,

Yes, that variable was a typo. To rule out something on my end, i created a new ubuntu-20.04 VM in GCP, installed an nvidia hpc sdk 23.11 docker container, and compiled. I copied and pasted that code above verbatim. I still get faults.

NVIDIA HPC SDK version 23.11
 
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
Use 'docker run --gpus all' to start this container; see
https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(Native-GPU-Support)

root@2792fa96b793:/# nvfortran signal_test.f
root@2792fa96b793:/# ./a.out
 Press Ctrl-C to exit
^CSegmentation fault (core dumped)
root@2792fa96b793:/# export PGI_TERM=signal
root@2792fa96b793:/# ./a.out
 Press Ctrl-C to exit
^CError: segmentation violation, address not mapped to object
root@2792fa96b793:/# 

If i add the “-1” flag, it works every time, even without the PGI_TERM variable.

root@2792fa96b793:/# ./a.out
 Press Ctrl-C to exit
^C Closing Programs
FORTRAN STOP
root@2792fa96b793:/# unset PGI_TERM
root@2792fa96b793:/# ./a.out
 Press Ctrl-C to exit
^C Closing Programs
FORTRAN STOP

Edit: For good measure, i tried an Intel VM as the other VM and my local machine are both AMD. This one doesn’t segfault, but it also doesn’t call the PROCEDURE function unless i add the “-1” flag.

root@4e35539e538c:/# nvfortran signal_test.f
root@4e35539e538c:/# ./a.out
 Press Ctrl-C to exit
^C
root@4e35539e538c:/# export PGI_TERM=signal
root@4e35539e538c:/# ./a.out
 Press Ctrl-C to exit
^C

Added “-1” flag:

root@4e35539e538c:/# vi signal_test.f
root@4e35539e538c:/# nvfortran signal_test.f
root@4e35539e538c:/# ./a.out
 Press Ctrl-C to exit
^C Closing Programs
FORTRAN STOP
root@4e35539e538c:/# 

PS Edit:
The code works on arm64 without the -1 flag and with or without the PGI_TERM.