Update C++ object in parallel loop using OpenAcc

I am trying to parallelize a loop with a C++ object but the code is around 10x slower on the GPU than on the CPU.

This is an oversimplified version of my code:

#include <cstdlib>
#include <ctime>
class Test {
public:
  Test() = default;
  ~Test() = default;

  void init() {
    // set member variables
  }

#pragma acc routine seq
  void update(int a, double b, double c, bool d //, and so on...
  ) {
    // perform alot of calculations and update member variables
  }

  std::vector<float> a;
  std::vector<double> b;
};

Test t;
int main() {
  t.init();

  // read large data from file
  // ...

  // iterate of each row of data
#pragma acc parallel loop
  for (int i = 1; i < data.size(); ++i) {
    t.update(data[i][0], data[i][1] //, and so on...
    )
  }

  return EXIT_SUCCESS;
}

I think I know the reason why this is slow (correct me if I am wrong). Each iteration of the loop is parallel but it tries to update the same Test object: t (and its members). Because of this, all the threads need to synchronize which slows the GPU down. But the issue is, I don’t know how to fix this. And the data I am reading is very large (which is why a GPU would be really useful here).

Doubtful given threads are only implicitly synchronized at the end of the vector loop. Given “update” is a sequential routine, no synchronization would occur.

I’d first double check that the compiler is able to parallelize the code by reviewing the compiler feedback messages (enabled via the “-Minfo=accel” flag).

Next, I’d run the code using the Nsight-Systems profiler to see where the slow performance is coming from. The most common cause is due to data movement between the device and host.

I thought the reason it may be slowing down was that, in my example code, I have one object Test t which has several data members. Since each iteration of the loop will run parallelly, all threads/vectors will perform the computation (i.e. the update() method) and then write the results into that common object. Because of this, wouldn’t it be slowed since there would have to be synchronization for each write operation? As opposed to threads/vectors writing to their own object in an array (which I believe doesn’t require synchronization)

I am compiling my code using these flags (in CMake):

set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -w -fast -acc -gpu=managed -Minfo=accel")

The output of -Minfo=accel is a bit verbose because of STL but I believe this is the main part:

58, Generating implicit private(i)
         Generating NVIDIA GPU code
         63, #pragma acc loop seq
     58, Generating implicit copy(...<VARIABLES>) [if not already present]
     63, Loop without integer trip count will be executed in sequential mode
         Complex loop carried dependence of ->__b_St12_Vector_baseISt6vectorIdSaIdEESaIS2_EE._M_impl._M_start,data.__b_St12_Vector_baseISt6vectorIdSaIdEESaIS2_EE._M_impl._M_start+(i*24)->__b_St12_Vector_baseIdSaIdEE._M_impl._M_start,->__b_St12_Vector_baseISt6vectorIdSaIdEESaIS2_EE._M_impl._M_finish prevents parallelization
         Loop carried dependence of -> prevents parallelization
         Loop carried backward dependence of -> prevents vectorization
         Generating implicit firstprivate(...<VARIABLES>)
         Loop carried dependence of -> prevents parallelization

It says: 63, #pragma acc loop seq but I used #pragma acc parallel loop. Why is this different here?

Also, I am new to Nsight and OpenAcc, so I may be wrong, but I don’t see any excessive data movements in the profile. I profiled the code using two commands:

nsys profile -f true -o full_profile ./output
nsys profile -t cuda,openacc -f true -o acc_profile ./output

I have attached the profiles here as well.

Could you also tell me a good resource for profiling OpenAcc code using Nsight Systems? A lot of the tutorials/videos are old, short, or just briefly go over the profile with screenshots.
Profiles.zip (2.1 MB)

It’s say that the loop is being run sequentially since it can’t determine the loop trip count. Likely due to the use of “data.size” which is a function. Try assigning the value of “data.size” to a scalar then use this scalar for the upper bound.

The other messages about the dependencies only mean that the compiler analysis can’t determine if the loop is independent or not. So it’s up to the programmer to ensure the loops can be parallelized.

OpenACC gets compiled to the same device code as CUDA so any tutorial should apply. Though I think Mozhgan uses OpenACC in this tutorial so may be the most relevant: https://www.youtube.com/watch?v=kKANP0kL_hk

The generate kernel names from our OpenACC implementation are “function name” + the line number from the source file.

Thank you, I will follow this video! Is it possible for you to take a look at the profiles and see what is wrong? I checked the profile again and I only see the data transfers happening at the beginning and end of the program

Also, this is a separate question, but I just wanted to confirm: the OpenAcc routine directive is applied where the function is defined or declared? if I have a header file where I have a class declaration with just its method signatures/declarations. And in another cpp file, I have the definitions. When I apply the routine directive on the function definition in the cpp file, I get errors but when I do it on the declarations the error goes away

You should fix the code to get the loop to parallelize (see above) and then profile it again.

the OpenAcc routine directive is applied where the function is defined or declared?

It needs to be visible by both the caller and the callee. Having the routine pragma in the header file where you have the prototype is visible in both spots. It doesn’t hurt having it at the definition as well, since it helps with self-documentation, but isn’t necessary.

Note that the compiler can auto-generate device routines when the definition is visible in the same compile unit as the caller. It wont help you here since the definition is in a separate file, but is useful for things like templates and lambdas.

1 Like

You should fix the code to get the loop to parallelize (see above) and then profile it again.

Wow. That worked perfectly. Thank you!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.