Incorrect GPU results with #pragma acc atomic capture

Good morning,

The following (reduced) example of using #pragma acc atomic capture in device code does not give correct results:

#include <algorithm>
#include <array>
#include <cassert>
#ifdef __NVCOMPILER
#include <openacc.h>
#endif
#include <string>
#include <vector>
int increment(int &counter, int always_zero) {
  int i;
  // This seems to be the problematic part:
  #pragma acc atomic capture
  i = counter++;
  // This condition is never satisfied. Replacing it with `always_zero` changes
  // which (wrong)  numbers are returned from this function (on our system).
  if ( i > 1000000 ) {
    assert(0); // replacing with printf makes issue go away
  }
  return i;
}
struct Indirector {
  // We could not reproduce the issue without this indirection.
  int *counter;
};
void kernel(bool compute_gpu, int size, bool always_true, int *record,
            Indirector *indirector, int always_zero) {
#pragma acc parallel loop copy(always_true, always_zero, record [0:2 * size])               \
    present(indirector [0:1]) if (compute_gpu)
  for (int i = 0; i < size; ++i) {
    if (always_true) {
      // record values from the first `increment` in the first half of the array
      record[i] = increment(*indirector->counter, always_zero);
    } else {
      record[i] = -2;
    }
    // record values from the second `increment` in the second half of the array
    record[size + i] = increment(*indirector->counter, always_zero);
  }
}

int main(int argc, char **argv) {
  std::size_t capacity{7};
  if ( argc > 1 ) {
    capacity = std::stoul(argv[1]);
  }
  bool global_pass{true};

  int counter;
  Indirector indirector;
  indirector.counter = &counter;
#ifdef __NVCOMPILER
  auto *d_indirector = static_cast<Indirector*>(acc_copyin(&indirector, sizeof(Indirector)));
  auto *d_nsb = acc_copyin(indirector.counter, sizeof(int));
  acc_memcpy_to_device(&d_indirector->counter, &d_nsb, sizeof(int*));
#endif
  for (int compute_gpu = 0; compute_gpu < 2; ++compute_gpu) {
    std::vector<int> record(2 * capacity, -1);
    counter = 0;
#pragma acc update device(counter) if (compute_gpu)
    kernel(compute_gpu, capacity, true, record.data(), &indirector, 0);
#pragma acc update self(counter) if (compute_gpu)
    assert(counter == int(record.size()));
    std::sort(record.begin(), record.end());
    printf("%s sorted values:", compute_gpu ? "GPU" : "CPU");
    bool pass = true;
    for (auto i = 0; i < int(record.size()); ++i) {
      auto val = record[i];
      if (val != i) {
        global_pass = pass = false;
      }
      printf(" %d", val);
    }
    printf(" %s\n", pass ? "[correct]" : "[incorrect]");
  }
  printf("%s\n", global_pass ? "PASS" : "FAIL");
  return !global_pass;
}

When compiled with nvc++ -acc -Wall -pedantic -o example example.cpp and executed with ./example N the test fails consistently for every N > 1 that I have tried.
Apologies that the example code is a little long, but it seemed worth including the result checking and parametrisation on N; the offloaded kernel is quite brief.

This fails at least with NVHPC 21.2 and 21.7, and at every optimisation level that I tried. The code compiles cleanly with GCC and Clang (-Wall -pedantic, CPU-only) and runs cleanly under cuda-memcheck and valgrind. The reproducer was tested on a RHEL7.6 system with the HPC-SDK installed via Spack

In the real-world problem where we hit this issue then [the equivalent of] removing the assert from the increment method was enough to produce correct results, but it was not clear to us whether there is really anything special about assert, or if we were just lucky that this avoided the problem. Debugging this issue in our complex, legacy code was not trivial and took quite some effort. We referred to OpenACC standard but didn’t find anything obviously wrong with the code.

Please let me know if you need any other information, or have trouble reproducing the issue. It would be great to have it fixed for the next release. Can you give an idea of when that is planned?

Best regards, Olli

1 Like

Hi Olli,

With the 21.3 release, we did have some major updated to our atomic support which introduced a few issues. But those have been largely fixed by 21.7.

For your code, it seems to get correct answers for me with 21.2 and 21.7, but the following error with 21.3 and 21.5. Though I’m able to work around this by adding the internal compiler flag “-Mx,231,0x1” which reverts to the older atomic support.

I’m not sure what accounts for the differences in what you’re seeing (i.e. fails with 21.2 and 21.7) and what I’m seeing.

-Mat

Works with 21.2 and 21.7

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.2 ; ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS
% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 ; ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS

Fails with 21.3 and 21.5

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.3 ; ./example
example: example.cpp:62: int main(int, char **): Assertion `counter == int(record.size())' failed.
Abort
% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.5 ; ./example
example: example.cpp:62: int main(int, char **): Assertion `counter == int(record.size())' failed.
Abort

Ok with 21.3 if I revert to the old atomics:

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.3 -Mx,231,0x1 ; ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS

Hi Mat,

Thanks for the quick reply. I realise that I should have included the output that we see:

$ nvc++ -acc -Wall -pedantic -o example example.cpp  -V21.7
$ ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 0 1 1 2 2 3 3 4 4 5 5 6 13 [incorrect]
FAIL

this pattern is typical for small N, but for larger numbers it changes a little:

$ ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 85 107 [incorrect]
FAIL

I tried -Mx,231,0x1 with 21.7, but it did not seem to make any difference. In case it helps, this is on a system with driver version 450.102.04, CUDA 11.0 and four V100-SXM2-16GB cards.

Please let me know if there are other tests we can try on our side.

I was confused as to why I didn’t see the error but determined that it only occurs on systems with GNU 4.8.5 installed. Seems ok with newer versions. In order to maintain object compatibility with g++, we need to use the STL that comes with the GNU installation.

With 4.8.5 STL, assert post-processes as:

((0) ? static_cast<void> (0) : __assert_fail ("0", "example.cpp", 17, __PRETTY_FUNCTION__));

Wiht GNU 7.4 giving:

(static_cast <bool> (0) ? void (0) : __assert_fail ("0", "example.cpp", 17, __PRETTY_FUNCTION__));

No idea why this then causes the atomic capture to fail. I went ahead and issued a problem report, TPR #30557, and sent it to engineering for review.

The code does get correct answers if I add “-DNDEBUG” to disable the asserts. Given asserts are mainly used for debugging, are they necessary? If so, the other work around would be to update the GNU version being used.

% g++ --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 ; ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 75 107 [incorrect]
FAIL

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 -DNDEBUG; ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
PASS
% g++ --version
g++ (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

%  nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 ; ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
PASS

Hi Mat,

Thanks for the investigation. A few follow up observations:

  • I tried (for the first time) forcing a different CUDA version, and this seems to make a difference. nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 -gpu=cuda11.4 and nvc++ -acc -Wall -pedantic -o example example.cpp -V21.2 -gpu=cuda11.2 both produce binaries that pass the tests. Both 21.2 and 21.7 still fail if I pass -gpu=cuda11.0, which IIUC is the default on this system.
  • We nominally use GCC 9.3.0 on this system, when installing NVHPC we run makelocalrc pointing to that installation. I have attached the localrc file from my 21.7 installation. But as far as I can see assert.h ultimately comes from /usr/include and the system glibc-headers-2.17-260.el7_6.6.x86_64 package. /usr/bin/g++ is indeed version 4.8.5. (In the localrc file set PREOPTIONS=-D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 is something our PGI/NVHPC deployment script has appended for a while. I checked that commenting out this line does not affect the test case we are discussing.)

Hope this helps.

Best, Olli