PGCC-S-0000-Internal compiler error. Call in OpenACC region

Hi,

when adding a ‘#pragma acc routine seq’ to a templated class’ constructor, the following error occurs in a different source file:

PGCC-S-0000-Internal compiler error. Call in OpenACC region to support routine - strlen (someSourceFile.cpp: 259)
PGCC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages)  (someSourceFile.cpp: 259)
std::char_traits<char>::length(const char *):
      5, include "someSourceFile.h"
           5, include "map"
               61, include "stl_map.h"
                    63, include "tuple"
                         39, include "array"
                              38, include "stdexcept"
                                   39, include "string"
                                        40, include "char_traits.h"
                                            259, Generating implicit acc routine seq
                                                 Accelerator restriction: unsupported call to support routine 'strlen'

Why does the compiler add the implicit acc routine statement at this point?

I’m using the pgc++ compiler version 15.4.

Thank you
Marco

Hi Marco,

Why does the compiler add the implicit acc routine statement at this point?

All subroutine calls made from within an OpenACC compute region must have a device version available. This means that the OpenACC “routine” directive must be added to all of these subroutines so the compiler knows to generate the device version.

In C++, adding “routine” to all subroutines would be very laborious. Also, there may be cases where the user doesn’t have access to the subroutine’s definition so can’t explicitly add “routine”. Hence, pgc++ will implicitly add “routine” to all visible subroutines called from within a compute region (as well as all routines that they call).

In this case, it looks like you have a call to “std::char_traits::length” which in turn calls “strlen”. Since strlen’s source isn’t visible to the compiler, it can’t automatically create the device code.

  • Mat

I understand this, but as far as I know, does the constructor, where the ‘routine’ was added, not call something in someSourceFile.cpp/h, where the compile error is reported. So I don’t understand, how the compiler thinks, that the call in line 259 of someSourceFile.cpp has to be on the GPU.

I just tried to remove or line or move it to another position, but the errors occrus in the same line, where now stands a comment. How can this happen?

Is there workaround to run strlen on the GPU? How can I solve this problem?

Marco

Hi Marco,

I just tried to remove or line or move it to another position, but the errors occrus in the same line, where now stands a comment. How can this happen?

My guess is that this isn’t the line where it’s coming from. The compiler wouldn’t try adding it if it wasn’t called. Though, without a reproducing example I can’t tell for sure.

Can you either post or send to PGI Customer Service (trs@pgroup.com) an example?

Thanks,
Mat

Hi Mat,

posting or sending code (or an example) is complicated, since I work in a large code from the university for my bachelor thesis.

Maybe I can explain the problem in a different way, I have code looking like this:

#pragma acc parrallel loop gang ...
for (int i=0; i<n; i++)
{
  // some Code
  SomeClass x();
  // some more code
  #pragma acc loop verctor(128) collapse(3)
  for (int j=0; j<n2; j++)
  {
    // some more loops, x used inside here
  }
} // end of this block, so x will be destructed

To realize this, the constructor of SomeClass has to offloaded to the GPU, adding

#pragma acc routine seq

results in the error described above at a completly different position in the code.

As a workaround I created a new constructor taking one more argument. This new constructor does the same, but runs on the GPU and will only be called within GPU regions. This looks like this:

class SomeClass {
  SomeClass();
  #pragma acc routine seq
  SomeClass(bool gpu);
}

This compiles, but now the destructor makes similar problems (EDIT: was ‘same’ before). The workaround I used for constructor doesm’t work, since the destructor can’t take arguments.

What is the right way to use C++ classes in OpenACC paralle regions?

I hope this describes my problem better.

Marco

Hi Marco,

I tend to avoid having classes instantiated within a compute region. While it can be done, besides the issues you’re encountering where the constructor/destructor are more complex in structure, most likely calls to “new/delete” will be made. Dynamic memory allocation from device code can be costly to performance.

How about allocating an array of “SomeClass” on the host, one for each “n”, so you don’t need to execute the constructor/destructor on the device?

  • Mat

Hi Mat,

The class contructed in the parallel region provides mulitdimensional tensor access to an flat 1D array. So the operator () is overloaded. What is the right way to use this operator in the parallel region? Do I need to create an array of these tensors on the host and then copy this tensor array to the GPU and then the data array for each tensor?

Sometimes I get the error ‘Unsupported union data type’. I posted an question for this in this forum some days ago, but I got no answer. Can you tell me what this error means/how I can solve this problem (or even answer to the post: https://forums.developer.nvidia.com/t/unsupported-union-data-type/134468/1)?

Thank you
Marco

P.S.
Jiri Kraus advised me to mention, that I work on the ZFS (RWTH Aachen University), so you know which code i’m talking about.

What is the right way to use this operator in the parallel region?

I hesitate to say what is the “right way” because it is situational. However in general I advise instantiating classes on the host.

Do I need to create an array of these tensors on the host and then copy this tensor array to the GPU and then the data array for each tensor?

Given what you describe, my suggestion would be to create an array of these tensors on the host. I would then use the CUDA Unified Memory Beta features (-ta=tesla:manged) of the 15.x compilers to have the CUDA runtime manage the dynamic memory. This will get you running quicker at the cost of some performance. Once working, you can then go back and optimize the data movement. Note that the UM beta is currently a separate download package.

Sometimes I get the error ‘Unsupported union data type’. I posted an question for this in this forum some days ago, but I got no answer. Can you tell me what this error means/how I can solve this problem (or even answer to the post:

Apologies for not getting you an answer on this one. Basically it’s a generic compiler error where it’s generating bad GPU code. However without a reproducing example I can’t tell what’s causing it nor how to work around it.

Jiri Kraus advised me to mention, that I work on the ZFS (RWTH Aachen University), so you know which code i’m talking about.

Yes, he let me know.

  • Mat