Good morning,
The following (reduced) example of using #pragma acc atomic capture
in device code does not give correct results:
#include <algorithm>
#include <array>
#include <cassert>
#ifdef __NVCOMPILER
#include <openacc.h>
#endif
#include <string>
#include <vector>
int increment(int &counter, int always_zero) {
int i;
// This seems to be the problematic part:
#pragma acc atomic capture
i = counter++;
// This condition is never satisfied. Replacing it with `always_zero` changes
// which (wrong) numbers are returned from this function (on our system).
if ( i > 1000000 ) {
assert(0); // replacing with printf makes issue go away
}
return i;
}
struct Indirector {
// We could not reproduce the issue without this indirection.
int *counter;
};
void kernel(bool compute_gpu, int size, bool always_true, int *record,
Indirector *indirector, int always_zero) {
#pragma acc parallel loop copy(always_true, always_zero, record [0:2 * size]) \
present(indirector [0:1]) if (compute_gpu)
for (int i = 0; i < size; ++i) {
if (always_true) {
// record values from the first `increment` in the first half of the array
record[i] = increment(*indirector->counter, always_zero);
} else {
record[i] = -2;
}
// record values from the second `increment` in the second half of the array
record[size + i] = increment(*indirector->counter, always_zero);
}
}
int main(int argc, char **argv) {
std::size_t capacity{7};
if ( argc > 1 ) {
capacity = std::stoul(argv[1]);
}
bool global_pass{true};
int counter;
Indirector indirector;
indirector.counter = &counter;
#ifdef __NVCOMPILER
auto *d_indirector = static_cast<Indirector*>(acc_copyin(&indirector, sizeof(Indirector)));
auto *d_nsb = acc_copyin(indirector.counter, sizeof(int));
acc_memcpy_to_device(&d_indirector->counter, &d_nsb, sizeof(int*));
#endif
for (int compute_gpu = 0; compute_gpu < 2; ++compute_gpu) {
std::vector<int> record(2 * capacity, -1);
counter = 0;
#pragma acc update device(counter) if (compute_gpu)
kernel(compute_gpu, capacity, true, record.data(), &indirector, 0);
#pragma acc update self(counter) if (compute_gpu)
assert(counter == int(record.size()));
std::sort(record.begin(), record.end());
printf("%s sorted values:", compute_gpu ? "GPU" : "CPU");
bool pass = true;
for (auto i = 0; i < int(record.size()); ++i) {
auto val = record[i];
if (val != i) {
global_pass = pass = false;
}
printf(" %d", val);
}
printf(" %s\n", pass ? "[correct]" : "[incorrect]");
}
printf("%s\n", global_pass ? "PASS" : "FAIL");
return !global_pass;
}
When compiled with nvc++ -acc -Wall -pedantic -o example example.cpp
and executed with ./example N
the test fails consistently for every N > 1
that I have tried.
Apologies that the example code is a little long, but it seemed worth including the result checking and parametrisation on N
; the offloaded kernel is quite brief.
This fails at least with NVHPC 21.2 and 21.7, and at every optimisation level that I tried. The code compiles cleanly with GCC and Clang (-Wall -pedantic
, CPU-only) and runs cleanly under cuda-memcheck
and valgrind
. The reproducer was tested on a RHEL7.6 system with the HPC-SDK installed via Spack
In the real-world problem where we hit this issue then [the equivalent of] removing the assert
from the increment method was enough to produce correct results, but it was not clear to us whether there is really anything special about assert
, or if we were just lucky that this avoided the problem. Debugging this issue in our complex, legacy code was not trivial and took quite some effort. We referred to OpenACC standard but didn’t find anything obviously wrong with the code.
Please let me know if you need any other information, or have trouble reproducing the issue. It would be great to have it fixed for the next release. Can you give an idea of when that is planned?
Best regards, Olli