Incorrect GPU results with #pragma acc atomic capture

Good morning,

The following (reduced) example of using #pragma acc atomic capture in device code does not give correct results:

#include <algorithm>
#include <array>
#include <cassert>
#ifdef __NVCOMPILER
#include <openacc.h>
#endif
#include <string>
#include <vector>
int increment(int &counter, int always_zero) {
  int i;
  // This seems to be the problematic part:
  #pragma acc atomic capture
  i = counter++;
  // This condition is never satisfied. Replacing it with `always_zero` changes
  // which (wrong)  numbers are returned from this function (on our system).
  if ( i > 1000000 ) {
    assert(0); // replacing with printf makes issue go away
  }
  return i;
}
struct Indirector {
  // We could not reproduce the issue without this indirection.
  int *counter;
};
void kernel(bool compute_gpu, int size, bool always_true, int *record,
            Indirector *indirector, int always_zero) {
#pragma acc parallel loop copy(always_true, always_zero, record [0:2 * size])               \
    present(indirector [0:1]) if (compute_gpu)
  for (int i = 0; i < size; ++i) {
    if (always_true) {
      // record values from the first `increment` in the first half of the array
      record[i] = increment(*indirector->counter, always_zero);
    } else {
      record[i] = -2;
    }
    // record values from the second `increment` in the second half of the array
    record[size + i] = increment(*indirector->counter, always_zero);
  }
}

int main(int argc, char **argv) {
  std::size_t capacity{7};
  if ( argc > 1 ) {
    capacity = std::stoul(argv[1]);
  }
  bool global_pass{true};

  int counter;
  Indirector indirector;
  indirector.counter = &counter;
#ifdef __NVCOMPILER
  auto *d_indirector = static_cast<Indirector*>(acc_copyin(&indirector, sizeof(Indirector)));
  auto *d_nsb = acc_copyin(indirector.counter, sizeof(int));
  acc_memcpy_to_device(&d_indirector->counter, &d_nsb, sizeof(int*));
#endif
  for (int compute_gpu = 0; compute_gpu < 2; ++compute_gpu) {
    std::vector<int> record(2 * capacity, -1);
    counter = 0;
#pragma acc update device(counter) if (compute_gpu)
    kernel(compute_gpu, capacity, true, record.data(), &indirector, 0);
#pragma acc update self(counter) if (compute_gpu)
    assert(counter == int(record.size()));
    std::sort(record.begin(), record.end());
    printf("%s sorted values:", compute_gpu ? "GPU" : "CPU");
    bool pass = true;
    for (auto i = 0; i < int(record.size()); ++i) {
      auto val = record[i];
      if (val != i) {
        global_pass = pass = false;
      }
      printf(" %d", val);
    }
    printf(" %s\n", pass ? "[correct]" : "[incorrect]");
  }
  printf("%s\n", global_pass ? "PASS" : "FAIL");
  return !global_pass;
}

When compiled with nvc++ -acc -Wall -pedantic -o example example.cpp and executed with ./example N the test fails consistently for every N > 1 that I have tried.
Apologies that the example code is a little long, but it seemed worth including the result checking and parametrisation on N; the offloaded kernel is quite brief.

This fails at least with NVHPC 21.2 and 21.7, and at every optimisation level that I tried. The code compiles cleanly with GCC and Clang (-Wall -pedantic, CPU-only) and runs cleanly under cuda-memcheck and valgrind. The reproducer was tested on a RHEL7.6 system with the HPC-SDK installed via Spack

In the real-world problem where we hit this issue then [the equivalent of] removing the assert from the increment method was enough to produce correct results, but it was not clear to us whether there is really anything special about assert, or if we were just lucky that this avoided the problem. Debugging this issue in our complex, legacy code was not trivial and took quite some effort. We referred to OpenACC standard but didn’t find anything obviously wrong with the code.

Please let me know if you need any other information, or have trouble reproducing the issue. It would be great to have it fixed for the next release. Can you give an idea of when that is planned?

Best regards, Olli

1 Like

Hi Olli,

With the 21.3 release, we did have some major updated to our atomic support which introduced a few issues. But those have been largely fixed by 21.7.

For your code, it seems to get correct answers for me with 21.2 and 21.7, but the following error with 21.3 and 21.5. Though I’m able to work around this by adding the internal compiler flag “-Mx,231,0x1” which reverts to the older atomic support.

I’m not sure what accounts for the differences in what you’re seeing (i.e. fails with 21.2 and 21.7) and what I’m seeing.

-Mat

Works with 21.2 and 21.7

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.2 ; ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS
% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 ; ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS

Fails with 21.3 and 21.5

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.3 ; ./example
example: example.cpp:62: int main(int, char **): Assertion `counter == int(record.size())' failed.
Abort
% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.5 ; ./example
example: example.cpp:62: int main(int, char **): Assertion `counter == int(record.size())' failed.
Abort

Ok with 21.3 if I revert to the old atomics:

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.3 -Mx,231,0x1 ; ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS

Hi Mat,

Thanks for the quick reply. I realise that I should have included the output that we see:

$ nvc++ -acc -Wall -pedantic -o example example.cpp  -V21.7
$ ./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 0 1 1 2 2 3 3 4 4 5 5 6 13 [incorrect]
FAIL

this pattern is typical for small N, but for larger numbers it changes a little:

$ ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 85 107 [incorrect]
FAIL

I tried -Mx,231,0x1 with 21.7, but it did not seem to make any difference. In case it helps, this is on a system with driver version 450.102.04, CUDA 11.0 and four V100-SXM2-16GB cards.

Please let me know if there are other tests we can try on our side.

I was confused as to why I didn’t see the error but determined that it only occurs on systems with GNU 4.8.5 installed. Seems ok with newer versions. In order to maintain object compatibility with g++, we need to use the STL that comes with the GNU installation.

With 4.8.5 STL, assert post-processes as:

((0) ? static_cast<void> (0) : __assert_fail ("0", "example.cpp", 17, __PRETTY_FUNCTION__));

Wiht GNU 7.4 giving:

(static_cast <bool> (0) ? void (0) : __assert_fail ("0", "example.cpp", 17, __PRETTY_FUNCTION__));

No idea why this then causes the atomic capture to fail. I went ahead and issued a problem report, TPR #30557, and sent it to engineering for review.

The code does get correct answers if I add “-DNDEBUG” to disable the asserts. Given asserts are mainly used for debugging, are they necessary? If so, the other work around would be to update the GNU version being used.

% g++ --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 ; ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 75 107 [incorrect]
FAIL

% nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 -DNDEBUG; ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
PASS
% g++ --version
g++ (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

%  nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 ; ./example 54
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 [correct]
PASS

Hi Mat,

Thanks for the investigation. A few follow up observations:

  • I tried (for the first time) forcing a different CUDA version, and this seems to make a difference. nvc++ -acc -Wall -pedantic -o example example.cpp -V21.7 -gpu=cuda11.4 and nvc++ -acc -Wall -pedantic -o example example.cpp -V21.2 -gpu=cuda11.2 both produce binaries that pass the tests. Both 21.2 and 21.7 still fail if I pass -gpu=cuda11.0, which IIUC is the default on this system.
  • We nominally use GCC 9.3.0 on this system, when installing NVHPC we run makelocalrc pointing to that installation. I have attached the localrc file from my 21.7 installation. But as far as I can see assert.h ultimately comes from /usr/include and the system glibc-headers-2.17-260.el7_6.6.x86_64 package. /usr/bin/g++ is indeed version 4.8.5. (In the localrc file set PREOPTIONS=-D__GCC_ATOMIC_TEST_AND_SET_TRUEVAL=1 is something our PGI/NVHPC deployment script has appended for a while. I checked that commenting out this line does not affect the test case we are discussing.)

Hope this helps.

Best, Olli

A brief update here: I installed NVHPC 21.9 on our cluster, still using CUDA 11.0, and was still able to reproduce this issue (with both 21.2 and 21.9).

Fortunately the workaround we developed for 21.2 still seems to work with 21.9.

Hope this helps.

Best, Olli

Thanks for the update Olli. I should have updated you that after investigating more, we don’t think the issue has to do with the GNU version, but rather may be a problem with the CUDA toolset that seems to have been fixed in CUDA 11.4. Though, engineering is still looking at it to be sure so hasn’t closed problem report.

Thanks for the update. I tried with NVHPC 21.9 and CUDA 11.4.2 (*) and I could not reproduce the issue anymore.

(*) I have CUDA 11.4.2 installed on the system and which nvcc shows that it comes from that standalone CUDA installation, but I am not sure if nvc++ uses this CUDA or if it uses the bundled version, which I understand is 11.4.1. Hopefully it isn’t important.

Thanks for the confirmation. Unless you’ve changed your installation configuration, it’s likely nvc++ is using the CUDA 11.4.1 that we ship with the compilers.

Hello Mat,

Revisiting this issue with NVHPC 23.1 and CUDA 12.0 we noticed that there is an issue with the CPU execution and this test when we use OpenMP instead of OpenACC.
Here is the original code transformed to use OpenMP instead of OpenACC:

// example_omp.cpp
#include <algorithm>
#include <array>
#include <cassert>
#include <iostream>
#ifdef __NVCOMPILER
#include <omp.h>
#endif
#include <string>
#include <vector>
// Not having the always_zero argument changes the generated results from ones to zeros
int increment(bool compute_gpu, int &counter, int always_zero) {
  int i;
// Attempted fix enabled with NMODL_FIX
#ifdef NMODL_FIX
  // This seems to be the problematic part:
  if (compute_gpu) {
    #pragma omp atomic capture
    i = counter++;
  } else {
    i = counter++;
  }
#else
  #pragma omp atomic capture
  i = counter++;
#endif
  return i;
}
struct Indirector {
  // We could not reproduce the issue without this indirection.
  int *counter;
};
void kernel(bool compute_gpu, int size, bool always_true, int *record,
            Indirector *indirector, int always_zero) {
#pragma omp target teams distribute parallel for map(tofrom : always_true, always_zero, record [0:2 * size]) if (compute_gpu)
  for (int i = 0; i < size; ++i) {
    if (always_true) {
      // record values from the first `increment` in the first half of the array
      record[i] = increment(compute_gpu, *indirector->counter, always_zero);
    } else {
      record[i] = -2;
    }
    // record values from the second `increment` in the second half of the array
    record[size + i] = increment(compute_gpu, *indirector->counter, always_zero);
  }
}

int main(int argc, char **argv) {
  std::size_t capacity{7};
  if ( argc > 1 ) {
    capacity = std::stoul(argv[1]);
  }
  bool global_pass{true};

  int counter;
  Indirector indirector;
  indirector.counter = &counter;
#ifdef __NVCOMPILER
  Indirector* d_indirector;
  #pragma omp target enter data map(to : indirector)
  #pragma omp target data use_device_ptr(indirector)
  {
    d_indirector = static_cast<Indirector*>(&indirector);
  }
  int* d_nsb;
  #pragma omp target enter data map(to : counter)
  #pragma omp target data use_device_ptr(counter)
  {
    d_nsb = static_cast<int*>(&counter);
  }
  omp_target_memcpy(&d_indirector->counter,
                    &d_nsb,
                    sizeof(int*),
                    0,
                    0,
                    omp_get_default_device(),
                    omp_get_initial_device());
#endif
  for (int compute_gpu = 0; compute_gpu < 2; ++compute_gpu) {
    std::vector<int> record(2 * capacity, -1);
    counter = 0;
    if (compute_gpu) {
#pragma omp target update to(counter)
    }
    kernel(compute_gpu, capacity, true, record.data(), &indirector, 0);
    if (compute_gpu) {
#pragma omp target update from(counter)
    }
    assert(counter == int(record.size()));
    std::sort(record.begin(), record.end());
    printf("%s sorted values:", compute_gpu ? "GPU" : "CPU");
    bool pass = true;
    for (auto i = 0; i < int(record.size()); ++i) {
      auto val = record[i];
      if (val != i) {
        global_pass = pass = false;
      }
      printf(" %d", val);
    }
    printf(" %s\n", pass ? "[correct]" : "[incorrect]");
  }
  printf("%s\n", global_pass ? "PASS" : "FAIL");
  return !global_pass;
}

Here is the execution output for the different scenarios:
Compilation and execution of simple OpenACC to OpenMP transformation without the NMODL_FIX code we introduced

rm -f example example_omp
nvc++ -mp=gpu -Wall -pedantic -o example_omp example_omp.cpp
./example_omp
CPU sorted values: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [incorrect]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
FAIL

Compilation and execution of OpenACC to OpenMP transformation with the NMODL_FIX code we introduced

rm -f example_omp
nvc++ -mp=gpu -Wall -pedantic -DNMODL_FIX -o example_omp example_omp.cpp
./example_omp
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS

Compilation and execution of the OpenACC original code

nvc++ -acc -Wall -pedantic -o example example.cpp
./example
CPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
GPU sorted values: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 [correct]
PASS

Notice that the CPU results with the OpenMP backend are now 1 s while the OpenMP GPU backend is executed correctly without the additional NMODL_FIX code.
I have also tried changing

#pragma omp target teams distribute parallel for map(tofrom : always_true, always_zero, record [0:2 * size]) if (compute_gpu)

to a pragma using metadirective

#pragma omp metadirective \
        when( user={condition(compute_gpu)}: target teams distribute parallel for map(tofrom : always_true, always_zero, record [0:2 * size]))  \
        default( parallel for )

but it didn’t help without the NMODL_FIX code. I have also reproduced the same issue with NVHPC 22.3 and CUDA 11.6.1.
Thank you very much in advance for looking into this.

Kind regards,
Ioannis

Hi Ioannis,

I’m not sure what’s going on here hence will need to have engineering take a look. Likely a compiler issue so I filed a problem report, TPR #33177.

-Mat

Hello Mat,

Thank you very much for letting us know.

Kind regards,
Ioannis