Suggestions needed to debug a weird segfault

Hi, I recently upgraded to pgc++ under nvhpc 20.9. A large code base that used to (and still does with 19.10 on Linux) run fine had a very weird segfault error. Sadly I tried to pull this piece out of the code base and I cannot reproduce the segfault like what I did to report compiler bugs in the past. So I’ll try my best to describe what I’ve seen.

There is only one CPU thread running, and OpenACC kernels are running on async queue 0. The segfault happened on the second time the program was at line “h (dt, a)” – can’t step in h (dt, a).

void g (double dt, double& a) {
   f (a);
   h (dt, a);
}

void f (double& a) {
   f_wrapper2 (a, A_FEW_MORE_ARGS);
}

void f_wrapper2 (double& a, A_FEW_MORE_ARGS) {
   f_acc (a, A_FEW_MORE_ARGS);
}

where f_acc is extremely simple – almost does nothing.

void f_acc (double& a, A_FEW_MORE_ARGS) {
#pragma acc wait(0)
}

One more thing I noticed was when I put “watch *(double*)0x7fffffffd2c8” which was the address of variable “dt” in gdb, its value was changed (should have been 0.001) immediately after the second time f_acc was called. I later tracked down to this symbol

0x7fffdc6c09dd callq 0x7fffdc67e5d0 <__pgi_uacc_cuda_wait@plt>__pgi_uacc_wait+423

that the value of dt was changed inside this call. The final message given by gdb was something like

received signal SIGSEGV, Segmentation fault.
g (dt=0.4816228645377123, a=<error reading variable>)

This is what I can provide at this time and I would appreciate your suggestions. Thanks in advance.

-stw

Hi stw,

Difficult to tell what going on, but I do have some questions.

Doesn’t quite make sense why the segv would occur in the wait call, but this may be a red-herring. Since you have a wait, I presume you’re using async? Maybe the segv is occurring on a different thread and you’re only seeing it in the debugger as occurring in wait? What happens if you disable asynchronous behavior by setting the environment variable " NVCOMPILER_ACC_SYNCHRONOUS=1"?

If that happens to fix the segv, then there may be a race condition in the code. If not, rerun in the debugger to see if the segv shows up in a different section.

I do find it odd that you say that the “dt” variable’s value changes. Since “dt” is being passed by value, it’s stored on the stack. Could the stack be getting corrupted? Are you getting a stack overflow? What happens if you set your shell’s stack size to “unlimited”?

-Mat

Thank you Mat.

Let me clarify a few points first.

  1. The program has both CUDA and OpenACC code, where in CUDA we use non-zero CUDA stream so that at least in my understanding, setting CUDA_LAUNCH_BLOCKING would not make the kernels synchronous, therefore setting NVCOMPILER_ACC_SYNCHRONOUS may not be very useful either.

  2. We only use one CPU thread here and only one OpenACC queue – async(0). The corresponding CUDA stream was set by the following code:

int queue = acc_get_default_async();
cudaStream_t stream = (cudaStream_t)acc_get_cuda_stream(queue);

thus should there any be race condition like you mentioned, it’s out of our hands.

====================================================================

The followings are my new observations. I hope they will give you more clues.

  1. I added a print statement in function g and now I understood why calling function h gave me a segfault.
void g (double dt, double& a) {
   f (a); // call f_acc(double& a, ...) inside.
   printf("dt=%lf %p\n", dt, &a);
   h (dt, a);
}

/* output
dt=0.001000 0x7ffeff52d868     <== 1st time program called g, normal 48-bit address.
dt=0.121278 0xbfc3b4e909454000 <== 2nd time program called g, weird &a value causing segfault in h (dt, a).
                               <== 64-bit address?
*/
  1. I set “ulimit -s unlimit” like you suggested. No difference.

  2. I replaced “acc wait(0)” with this in “f_acc”:

#pragma acc serial
{ a = 100.; }

It gave me normal output and no segfault.

dt=0.001000 0x7ffc9b2e85d8
dt=0.001000 0x7ffc9b2e85d8
  1. I also tried “#pragma acc wait”, (not wait(0)) in f_acc function, same erroneous output and segfault.

  2. Changing back to “#pragma acc wait(0)”, I managed to added the following break points in gdb.

b *__pgi_uacc_wait+423              # break point a
b *__pgi_uacc_cuda_wait+1164        # break point b
b *__pgi_uacc_cuda_drain_down+643   # break point c

For the first normal output, I found program didn’t have a chance to reach break point c. As for the second broken output, the program crashed not far after stepping into break point c. I have also received these debugging messages.

Thread 1 hit Breakpoint 11, 0x00007fffdc3456ac in __pgi_uacc_cuda_wait (lineno=46, async=0, dindex=1) at ../../src/cuda_wait.c:80
80  ../../src/cuda_wait.c: No such file or directory.
[Switching to thread 4 (Thread 0x7fff425a9700 (LWP 8460))](running)

Thread 1 hit Breakpoint 15, 0x00007fffdc33b3ad in __pgi_uacc_cuda_drain_down (devnum=1, qq=0, test=0, tag=0) at ../../src/cuda_drain.c:74

74  ../../src/cuda_drain.c: No such file or directory.
[Switching to thread 4 (Thread 0x7fff425a9700 (LWP 8460))](running)
[Switching to thread 4 (Thread 0x7fff425a9700 (LWP 8460))](running)
[Switching to thread 4 (Thread 0x7fff425a9700 (LWP 8460))](running)
[Switching to thread 4 (Thread 0x7fff425a9700 (LWP 8460))](running)
[Switching to thread 4 (Thread 0x7fff425a9700 (LWP 8460))](running)

Thread 1 received signal SIGILL, Illegal instruction.
0x00007fffdc33b11d in __pgi_uacc_cuda_download_return (de=0x7fffda69db80, devnum=0, qq=0) at ../../src/cuda_download_event.c:143
143  ../../src/cuda_download_event.c: No such file or directory.

I don’t know how useful these messages are to be honest…

Thanks again.

I’m still perplexed by this one and not sure if I have any good ideas on how to solve this. Though, it still feels like there’s some type of synchronization issue or possibility a corrupted stack.

I’m afraid I’ll probably need to try debugging it myself to get a better sense on what’s going on. Are you able to share the code? (If so, I can email you directly since I presume it’s not code you can share publicly)

Are you using the “-Mcuda” flag during linking? Without the option, the OpenACC does not use the same default stream as CUDA.

It wasn’t clear to me, did you try setting “CUDA_LAUNCH_BLOCKING=1” and it still failed in the same way?

The Illegal instruction is very odd and I’m not sure what to make of it. Though, since it only occurs in the debugger, I’ll assume it’s just a oddity. Though just in case, what CPU are you using?

Hi Mat,

Our code is put on a public GitHub repo. It needs some prerequisites to build though. If you’d prefer, I can also set up an account for you on a Linux box that has a public IP address. I’ll prebuild the prerequisites for you. I will send you another message directly with more details.

I really appreciate your help.

Ps, my laptop is using Intel Core i7-9750H CPU @ 2.60GHz x 12, GeForce RTX 2070 with Max-Q Design.

I did set CUDA_LAUNCH_BLOCKING=1 and it didn’t help.

I believe I didn’t use “-Mcuda” but I didn’t use the default CUDA NULL stream either.

Well, I’ve finally found a pure OpenACC example here, and it’s quite small. You probably don’t need to compile the entire original software from GitHub. The symptoms are a little different from the original post, but I think they still suggested a corrupted stack. This example didn’t crash with segfault, but I think it was an accident. In the process of rearranging the example, I had seen normal exit, weird outputs, segfault, and bus error…

Thanks,


Header file: h.h

#ifndef H_H
#define H_H
extern int n;
extern double *vx, *vy, *vz; // device pointers

void init(); // set values in vx, vy, vz
void end();  // free device pointers
// these functions really don't do anything
void f_acc(double&);
void g(double, double&);
void h_acc(int);
#endif

main.cpp: doesn’t have any OpenACC pragma or function.

#include "h.h"
#include <cstdio>

int main() {
   init();
   double a = 3.14, dt = 0.001;
   for (int i = 0; i < 2; ++i) {
      g(dt, a);
      h_acc(i);
   }
   end();
   return 0;
}

void g(double dt, double& a) {
   printf(" 1 address %p %p\n", &dt, &a);
   f_acc(a);
   printf(" 2 address %p %p\n", &dt, &a);
}

int n;
double *vx, *vy, *vz;

m1.cpp: OpenACC-related code

#include <openacc.h>
#include "h.h"

static const int N = 3;
void init() { // set values in vx, vy, vz
   double cx[N], cy[N], cz[N];
   for (int i = 0; i < N; ++i) {
      cx[i] = i; cy[i] = -i; cz[i] = 1;
   }

   vx = (double*)acc_malloc(sizeof(double)*N);
   vy = (double*)acc_malloc(sizeof(double)*N);
   vz = (double*)acc_malloc(sizeof(double)*N);
   acc_memcpy_to_device(vx, cx, sizeof(double)*N);
   acc_memcpy_to_device(vy, cy, sizeof(double)*N);
   acc_memcpy_to_device(vz, cz, sizeof(double)*N);
   // acc_memcpy_to_device() is probably blocking...
   // I still added acc wait here...
   #pragma acc wait
   
   n = N;
}

void end() { // free device pointers
   acc_free(vx); acc_free(vy); acc_free(vz);
}

void f_acc(double&) { // does nothing
   #pragma acc parallel loop independent async(0)
   for (int i = 0; i < n; ++i) {}
   #pragma acc wait(0)
}

// does not change vx, vy, vz; no output; no return value
void h_acc(int) {
   double vtot1=0, vtot2=0, vtot3=0;
   #pragma acc parallel loop independent async(0) deviceptr(vx,vy,vz)
   for (int i = 0; i < n; ++i) {
      vtot1 += vx[i];
      vtot2 += vy[i];
      vtot3 += vz[i];
   }

   vtot1 /= N;
   vtot2 /= N;
   vtot3 /= N;
}

Makefile

OUT=a20.out
CXX=/opt/nvidia/hpc_sdk/Linux_x86_64/2020/compilers/bin/pgc++

#OUT=a19.out
#CXX=/opt/pgi/linux86-64-llvm/19.10/bin/pgc++

$(OUT): main.cpp m1.cpp h.h
	$(CXX) -g -O0 -std=c++11 -ta=tesla:cc75 -acc verystrict -Minfo=accel -c m1.cpp
	$(CXX) -g -O0 -std=c++11 -c main.cpp
	$(CXX) -g -O0            -ta=tesla:cc75 -acc -o $(OUT) main.o m1.o

Results:

$ ./a19.out 
 1 address 0x7fff1012b948 0x7fff1012b978
 2 address 0x7fff1012b948 0x7fff1012b978
 1 address 0x7fff1012b948 0x7fff1012b978
 2 address 0x7fff1012b948 0x7fff1012b978
===> seemed normal

$ ./a20.out
 1 address 0x7ffc4d4829d8 0x7ffc4d482a08
 2 address 0x7ffc4d4829d8 0x7ffc4d482a08
 1 address 0x7ffc4d4829d8 0x7ffc4d482a08
 2 address 0x4008000000000000 0x7ffc4d482a08
                 ||
                 \/
           &dt has a 64-bit memory address?

Thanks! The reproducer is great and I’m able to reproduce the issue here. Looks like the issue first began with the 20.1 release. Still unclear exactly what’s causing it, so I’ll need a compiler engineer to dig into it, but do think it’s a compiler issue and not an issue with your code. I’ve added a problem report (TPR #29171) and sent it engineer for further investigation.

Setting “NVCOMPILER_ACC_SYNCHRONOUS=1” did work around the problem for me. Otherwise, the only other work around you might try is explicitly removing the async clauses and wait directives, or continue using 19.10 until we get this fixed.

1 Like

Using OpenACC managed data for “vtot1/2/3” variables in “h_acc()” function seemed to be a workaround for my application.

Hi oukore,

Sorry for taking so long to get to the bottom of this. So basically, it is a problem with your code in that the reduction loop in the “h_acc” does need to be blocking since you use the results directly after the loop. So you’ll need to remove the async clause or add a “#pragma acc wait” directly after it. Note that pre-20.1, reductions were implicitly blocking but we were able then to implement the feature of delaying return of the reduction variables until the first wait is encountered.

What was perplexing to me what was causing the stack corruption. Here’s the explanation from engineering:

Without the wait directive, what we have is an ALDD: asynchronous local data download problem. The #pragma acc parallel loop independent deviceptr(vx,vy,vz) async directive has three implicit reduction operations on vtot1, vtot2, vtot3, and hence three implicit copy() operations. Because of the ‘async’ clause, the data operations are also asynchronous. Without the wait directive, the procedure returns before the asynchronous data operations are complete. The subsequent call to ‘g’ eventually calls f_acc which includes a correct ‘wait’ directive, at which point the downloads complete. But those stack locations now belong to other variables, and hence those variables get overwritten.

-Mat

1 Like