Hello,
I have been working on refactoring some OpenACC code, which has involved the use of more C++ lambdas. This led to some surprising runtime errors and (apparent) stack corruption. I believe I have captured the relevant behaviour in the example below.
It appears that when I asynchronously launch a kernel that calls a lambda, the runtime remembers the address of the (stack-allocated, local) lambda and writes to that address at the next synchronisation (#pragma acc wait(0)
) point. In the example, the object has gone out of scope by that point and its (stack) address is now part of buf
.
Even without the use of alloca
and buf
it is possible to see in a debugger stacktraces like
#0 __c_mcopy (dst=0x7fffffffcd18, src=0x7fff86000000, len=1) at ../../src/map2mcopy.h:35
#1 0x00007fffed6cc68f in __pgi_uacc_move_buffer (dd=0x7fff86000010) at ../../src/move_buffer.c:62
#2 0x00007fffed13afa0 in __pgi_uacc_cuda_drain_down (devnum=1, qq=0, test=0, tag=0) at ../../src/cuda_drain.c:131
#3 0x00007fffed145b40 in __pgi_uacc_cuda_wait (lineno=20, async=0, dindex=1) at ../../src/cuda_wait.c:80
...
with dst
addresses that are no longer in scope.
In the OpenACC 3.2 spec (l.1242-3) I see
If a C++ lambda is called in a compute region and does not appear in a data clause, then it is treated as if it appears in a copyin clause on the current construct.
But in the example below, adding copyin(lambda)
to the acc kernels
directive removes the error. Also, if I pass -Minfo=accel
then I see
Generating implicit copy(lambda) [if not already present]
Which seems consistent with the problematic behaviour I see, but not with the part of the spec quoted above.
Some other observations:
- I was not able to reproduce the problem using a plain
int
/float
object, or a custom struct; only a lambda. - The lambda is stateless, so it seems unnecessary for there to be any data transfer at all.
- Returning
&lambda
is just to allow the assertions onold_ptr
, it does not affect the behaviour.
I assume that the example can be made shorter, but this serves to illustrate the point.
#include <alloca.h>
#include <cassert>
constexpr int magic_number = 42424242;
constexpr std::size_t block_size = 1000;
void const* launch(double* p) {
auto const lambda = [](double) { return 42.0; };
#pragma acc kernels present(p[0:1]) async(0)
{
*p = lambda(*p);
}
return λ
}
void wait(void const* old_ptr) {
// allocate a big chunk of the stack
int* buf = static_cast<int*>(alloca(block_size * sizeof(int)));
for(auto i = 0; i < block_size; ++i) {
buf[i] = magic_number;
}
assert(old_ptr >= &buf[0]);
assert(old_ptr < &buf[block_size]);
// the nvhpc runtime will clobber `buf`
#pragma acc wait(0)
for(auto i = 0; i < block_size; ++i) {
assert(buf[i] == magic_number);
}
}
int main() {
double p{1.};
#pragma acc enter data copyin(p)
auto* dead_ptr = launch(&p);
wait(dead_ptr);
return 0;
}
This is compiled with nvc++ -V22.3 -acc -O test.cpp
and produces
$ ./a.out
a.out: test.cpp:24: void wait(const void *): Assertion `buf[i] == magic_number' failed.
Aborted
on my system at optimisation levels up to -O
; there is no error at -O2
(although in the original application, I saw problems with -O2 -g
). I also tried 21.11 and 22.2, and the test fails there too (at -O
).
This behaviour is surprising, and superficially seems not to match the spec (although I do not claim expertise there!), and led to rather difficult-to-diagnose crashes later in execution. Naïvely it seems that the compiler should know that the captured lambda’s lifetime may have ended before the next synchronisation occurs, but I did not see any diagnostics. It would be great if this could be improved in an upcoming compiler release.