Problem using the -tp=x64 flag with OpenACC


I use a source code with OpenACC directives that work well on GPUs not couple with x64 CPU architectures, I use the following options :

-acc -c -O3 -Minfo -Mcuda=cc2+

But now, I want to use better GPUs and these GPUs are coupled with x64 CPU, so I need to add the -tp=x64 flag (like I’m using CUDA) at the compilation :

-acc -c -O3 -Minfo -Mcuda=cc2+ -tp=x64

But I get the following error when I’m compiling :

nvvmCompileProgram error: 9.
Error: /tmp/pgaccbVWgdlOGjHOs.gpu (960, 24): parse invalid redefinition of function 'functionA'
PGF90-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (mod_functions.f90: 1)
PGF90/x86-64 Linux 16.10-0: compilation aborted

Did I miss something with the -tp=x64 flag ? Thank you for your attention !

Hi Mr. Dark,

“-tp=x64” targets a Unified Binary (multiple x86 targets in the same binary) which is not yet supported with OpenACC.

If you need to the OpenACC binary to run on multiple generations of x86 architectures, you’ll need to either compile targeting the most generic of the architectures or “-tp=px-64” which will run on any modern x86 system.

Hope this helps,

oh all right I understand now, it works well ! Thank you.

I would also want to know something about managed memory : I want to put my structures with the managed attributes, but it seems not working (with CUDA I have no problems). Is it normal ? Thanks again !

The “-ta=tesla:managed” options (i.e. use CUDA Unified Memory) basically replaces allocation calls such as malloc, new, and allocate, with calls to cudaMallocManaged. So only works with dynamically allocated data.

So if your structures are static, then this is why managed isn’t being used.


Oh okay so I don’t need to put any data clauses (like !$acc data present, copyin etc…) ?

I just tried to use this managed option on one example, removing all my data clauses, and yes it seems to work very well with just the parallel/kernels directives. I get the same results as I wrote my code only with CUDA.

But if I try another example (basically its just the first example with more CPU functions added), the results are wrong : I get only zeros. Do you have any idea where the problem could be find ?

Thank you for your time.

Oh okay so I don’t need to put any data clauses (like !$acc data present, copyin etc…) ?

You might still need to use present if the type is a class or struct with pointers as data members or if you what to control the copy direction for statically allocated arrays. Otherwise, no, you probably don’t need to use the data clauses.

Do you have any idea where the problem could be find ?

Assuming that the data movement is correct (which is the most common cause of wrong answers) I’d start by looking at the compiler feedback messages (-Minfo=accel) to make sure that after adding the function calls, the code still parallelized.

Beyond that, I’m not sure. If you can share code either here or by send it to PGI Customer Service (, it should help in determining the cause.


Ok I understand how it works.

I think I found the problem, the -Minfo gives me the details and effectively the code isn’t well parallelized, I must try something else. If I don’t find the answer I’ll ask you for helping me :)

Thank you again !