OpenACC runtime writes to out-of-scope lambdas

olupton · May 3, 2022, 1:39pm

Hello,

I have been working on refactoring some OpenACC code, which has involved the use of more C++ lambdas. This led to some surprising runtime errors and (apparent) stack corruption. I believe I have captured the relevant behaviour in the example below.

It appears that when I asynchronously launch a kernel that calls a lambda, the runtime remembers the address of the (stack-allocated, local) lambda and writes to that address at the next synchronisation (#pragma acc wait(0)) point. In the example, the object has gone out of scope by that point and its (stack) address is now part of buf.

Even without the use of alloca and buf it is possible to see in a debugger stacktraces like

#0 __c_mcopy (dst=0x7fffffffcd18, src=0x7fff86000000, len=1) at ../../src/map2mcopy.h:35
#1 0x00007fffed6cc68f in __pgi_uacc_move_buffer (dd=0x7fff86000010) at ../../src/move_buffer.c:62
#2 0x00007fffed13afa0 in __pgi_uacc_cuda_drain_down (devnum=1, qq=0, test=0, tag=0) at ../../src/cuda_drain.c:131
#3 0x00007fffed145b40 in __pgi_uacc_cuda_wait (lineno=20, async=0, dindex=1) at ../../src/cuda_wait.c:80
...

with dst addresses that are no longer in scope.

In the OpenACC 3.2 spec (l.1242-3) I see

If a C++ lambda is called in a compute region and does not appear in a data clause, then it is treated as if it appears in a copyin clause on the current construct.

But in the example below, adding copyin(lambda) to the acc kernels directive removes the error. Also, if I pass -Minfo=accel then I see

Generating implicit copy(lambda) [if not already present]

Which seems consistent with the problematic behaviour I see, but not with the part of the spec quoted above.

Some other observations:

I was not able to reproduce the problem using a plain int/float object, or a custom struct; only a lambda.
The lambda is stateless, so it seems unnecessary for there to be any data transfer at all.
Returning &lambda is just to allow the assertions on old_ptr, it does not affect the behaviour.

I assume that the example can be made shorter, but this serves to illustrate the point.

#include <alloca.h>
#include <cassert>
constexpr int magic_number = 42424242;
constexpr std::size_t block_size = 1000;
void const* launch(double* p) {
 auto const lambda = [](double) { return 42.0; };
 #pragma acc kernels present(p[0:1]) async(0)
 {
  *p = lambda(*p);
 }
 return &lambda;
}
void wait(void const* old_ptr) {
 // allocate a big chunk of the stack
 int* buf = static_cast<int*>(alloca(block_size * sizeof(int)));
 for(auto i = 0; i < block_size; ++i) {
  buf[i] = magic_number;
 }
 assert(old_ptr >= &buf[0]);
 assert(old_ptr < &buf[block_size]);
 // the nvhpc runtime will clobber `buf`
 #pragma acc wait(0)
 for(auto i = 0; i < block_size; ++i) {
  assert(buf[i] == magic_number);
 }
}
int main() {
 double p{1.};
 #pragma acc enter data copyin(p)
 auto* dead_ptr = launch(&p);
 wait(dead_ptr);
 return 0;
}

This is compiled with nvc++ -V22.3 -acc -O test.cpp and produces

$ ./a.out
a.out: test.cpp:24: void wait(const void *): Assertion `buf[i] == magic_number' failed.
Aborted

on my system at optimisation levels up to -O; there is no error at -O2 (although in the original application, I saw problems with -O2 -g). I also tried 21.11 and 22.2, and the test fails there too (at -O).

This behaviour is surprising, and superficially seems not to match the spec (although I do not claim expertise there!), and led to rather difficult-to-diagnose crashes later in execution. Naïvely it seems that the compiler should know that the captured lambda’s lifetime may have ended before the next synchronisation occurs, but I did not see any diagnostics. It would be great if this could be improved in an upcoming compiler release.

MatColgrove · May 3, 2022, 4:04pm

Hi Olli,

The core problem is that the code is returning a pointer to a local variable which is not recommended. I can reproduce the error with or without OpenACC:

 % nvc++ test.cpp -O ; a.out
"test.cpp", line 11: warning: returning pointer to local variable
   return &lambda;
          ^

a.out: test.cpp:19: void wait(const void *): Assertion `old_ptr >= &buf[0]' failed.
Abort

-Mat

olupton · May 3, 2022, 4:26pm

Hi Mat,

Sorry, the assertions in the example may not have been the most helpful.
When I run with OpenACC, the assertion that fails is

a.out: test.cpp:24: void wait(const void *): Assertion `buf[i] == magic_number' failed.

i.e. the #pragma acc wait(0) in between the two loops over block_size clobbers one of the values (buf[978] on my system).

I am only returning a local variable so the test case can assert that my stack-allocated buffer (buf) does include the stack address that lambda used to occupy.

On my system then the test passes without OpenACC at -O1 and below:

$ nvc++ -V22.3 -O1 test.cpp
"test.cpp", line 12: warning: returning pointer to local variable
    return &lambda;
           ^

$ ./a.out
$

at higher optimisation levels and without OpenACC (which I had not tested), I confirm your result:

$ nvc++ -V22.3 -O test.cpp
"test.cpp", line 12: warning: returning pointer to local variable
    return &lambda;
           ^

$ ./a.out
a.out: test.cpp:21: void wait(const void *): Assertion `old_ptr < &buf[block_size]' failed.

but I do not think this is important; I guess in this case the optimiser is defeating my contrived example, which was only really targeting OpenACC. I hope this is clear(er)!

Cheers, Olli

MatColgrove · May 3, 2022, 7:52pm

I still assert that the program is in error. Returning an address from the stack will cause issues if it’s accessed since what ever that it pointed to can be different in another call.

Worse here because of the async. Since the copy back from the device is delayed until the wait is encountered, the copy back can overwrite the now different stack causing corruption.

Though, I’m going to assume that this is only an artifact of the example, and not what you’re doing in your full program. So the base question is why is lambda being implicitly copied back?

From the compiler view, a lambda is an object (an unnamed class type that has a function call operator). While we can see that it doesn’t need to be copied back, the compiler isn’t able to and must assume has internal state that’s been updated. So when it’s used back on the host, the compiler must assume that it needs to be copied back.

The work around is to add “copyin(lambda)” on the kernels directive so no copy back is performed.

-Mat

olupton · May 4, 2022, 9:10am

Hi Mat,

OK; for reference I have attached a version of the test that compiles cleanly with -Wall and still fails for -acc -O{0,1,}.

I think we agree about what’s going on, and that explicitly passing copyin avoids the issue.

My reading of the spec is that copyin is supposed to be the default here (because it’s a lambda), while it seems that nvc++ is using copy by default. It also seems that, putting aside the spec for the moment, the compiler should know (e.g. std::is_empty_v<decltype(lambda)>) that the device-to-host copy is not needed; similar logic already seems to be present for scalars and aggregates.

In any case, I just wanted to raise this because it seems that the default/implicit behaviour is quite dangerous, and there were no helpful diagnostics. Maybe in future the compiler could diagnose this pattern, as it should(?) know the lifetime of the local variables and whether or not there is a synchronisation in that scope.

Cheers, Olli

#include <alloca.h>
#include <cassert>
#include <iostream>
constexpr int magic_number = 42424242;
constexpr std::size_t block_size = 64;
void launch(double* p) {
  auto const lambda = [](double) { return 42.0; };
  std::cout << "lambda: " << &lambda << std::endl;
  #pragma acc kernels present(p[0:1]) async(0)
  {
    *p = lambda(*p);
  }
}
void wait() {
  // allocate a big chunk of the stack
  int* buf = static_cast<int*>(alloca(block_size * sizeof(int)));
  std::cout << "buf: " << &buf[0] << ' ' << &buf[block_size-1] << std::endl;
  for(auto i = 0; i < block_size; ++i) {
    buf[i] = magic_number;
  }
  // the nvhpc runtime will clobber `buf`
  #pragma acc wait(0)
  for(auto i = 0; i < block_size; ++i) {
    assert(buf[i] == magic_number);
  }
}
int main() {
  double p{1.};
  #pragma acc enter data copyin(p)
  launch(&p);
  wait();
  return 0;
}

MatColgrove · May 4, 2022, 6:25pm

Thanks Olli. I appreciate you reworking the example. I got focused on the returning of the stack pointer.

The lambda section was added in the OpenACC 3.0 Spec to clarify these types of situations but we haven’t fully implemented the newer specs. But the behavior is not in line with the standard so I added an issue report (TPR #31777) and asked our engineers to investigate.

-Mat