Problems with the device subprograms

When I use the subroutine with device attribute in CUDA Fortran, I find the device subprogram must be contained in a module and can only be invoked by subroutines or functions in this module.
Is it true?
Why it does in this way?

Hi OceanCloud,

Is it true?

Yes, in older versions of the compiler and by default in the current version. The main issue is that until recently, there wasn’t a linker for device code. Hence, device routines needed to be inlined by the compiler thus required the device routines to be placed in the same module as the global routines. (Note that this was true for CUDA C as well where device routines had to be in the same file scope as the global routines).

As of CUDA 5.0, we now can link device routines found in external objects when using the “-Mcuda=rdc” flag. The following PGinsider article gives a good explanation of its usage: Account Login | PGI



Hope this helps,
Mat

Hi, Mat

Thanks a lot.

I read the PGinsider article you mentioned, maybe the compile option is “-Mcuda=rdc” not “-Mcuda=rdo”. But I don’t quite understand when I use the “-Mcuda=rdc” flag and the “allocate” keyword in device routines, the compiler gives errors as below

“error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Unexpected runtime function call”

Why does this error occur?

maybe the compile option is “-Mcuda=rdc” not “-Mcuda=rdo”.

Correct, this was a typo in my part. I’ll go back and edit the post.

“error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Unexpected runtime function call”

Why does this error occur?

This typically means that a compiler generated host routine is being added to the device code. The one open bug (TPR#19462) I see with this failure has to with “pow” when “-i8” is used. This will be fixed in 13.9. If that’s not the same as yours, can you send a reproducing example to PGI Customer Service (trs@pgroup.com)?

Thanks,
Mat

Thanks, Mat

I mean when I test the codes given in the PGinsider article, the codes (dgemmdynamic.cuf, dgemmdynamic_strassen.cuf, dgemmdynamic_streams.cuf)can’t compile fine.

Enviroment: PGI Visual Fortran 13.8, Visual Studio 2012, Windows 7 x64
compile option: -Mcuda=cuda5.0,cc35,rdc
GPU card: K20C

Error message:

dgemmdynamic.cuf
C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor2afqqbp1lEUFtU.gpu(1010): error: identifier "mm88" is undefined

C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor2afqqbp1lEUFtU.gpu(1010): error: identifier "mm28" is undefined

2 errors detected in the compilation of "C:\Users\Adiministrator\AppData\Local\Temp\pgnvd2aGq4bGHw4zbl_.nv0".
D:\Research\Programming\Routine\CUDA Fortran\test\dgemmdynamic.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 13.8-0: compilation aborted


dgemmdynamic_strassen.cuf
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c-qbc9YSk0ywp.ptx, line 2337; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c-qbc9YSk0ywp.ptx, line 2441; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c-qbc9YSk0ywp.ptx, line 2545; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c-qbc9YSk0ywp.ptx, line 3082; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c-qbc9YSk0ywp.ptx, line 3177; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas : fatal error : Ptx assembly aborted due to errors
pgnvd-Fatal-Could not spawn c:\program files\pgi\win64/2013/cuda/5.0/bin\ptxas.exe
D:\Research\Programming\Routine\CUDA Fortran\test\dgemmdynamic_strassen.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 13.8-0: compilation aborted

dgemmdynamic_streams.cuf
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c0KubCnuvxO8w.ptx, line 2372; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas C:\Users\Adiministrator\AppData\Local\Temp\pgcudafor4c0KubCnuvxO8w.ptx, line 3257; : error : Instruction 'kernel function address' requires .target sm_35 or higher
ptxas : fatal error : Ptx assembly aborted due to errors
pgnvd-Fatal-Could not spawn c:\program files\pgi\win64/2013/cuda/5.0/bin\ptxas.exe
D:\Research\Programming\Routine\CUDA Fortran\test\dgemmdynamic_streams.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 13.8-0: compilation aborted

The above three routines all contain “allocate” statements, and the dgemmdynamic_strassen.cuf, dgemmdynamic_streams.cuf routines contain dynamic parallelism.

Maybe you can point out where the problem is from the above description.