About dynamic parallelism of CUDA Fortran

Hello!

I write a very simple example to test the dynamic parallelism by PGI Visual Fortran 13.10, the source codes are listed below:

Kernel.cuf:

module Kernel
	real,allocatable,device :: a_d(:,:)
	integer,parameter :: m=10,n=4

contains

	attributes(global) subroutine FatherKernel()
		implicit none

		integer :: i,istat

		i=blockDim%x*(blockIdx%x-1)+threadIdx%x
		
		if (i<=m) then
			call ChildKernel<<<2,2>>>(i)
			istat=cudaDeviceSynchronize()
		end if
	end subroutine

	attributes(global) subroutine ChildKernel(row)
		implicit none

		integer,value :: row

		integer :: i

		i=blockDim%x*(blockIdx%x-1)+threadIdx%x
		
		if (i<=n) then
			a_d(row,i)=row*10+i
		end if
	end subroutine
end module

Main.f90:

program Main
    use Kernel
    use cudafor
    implicit none
    
    real,allocatable :: a(:,:)
    integer :: istat
    
    allocate(a(m,n),a_d(m,n))
    a=0.0
    a_d=a
    
    call FatherKernel<<<2,5>>>()
    istat=cudaDeviceSynchronize()
    
    a=a_d
    
    print *,a
    deallocate(a,a_d)
    
    pause
    stop
end program

The compile command is:

-Bstatic -Mbackslash -Mcuda=cc35,ptxinfo,rdc -I"C:\Program Files\CULA\R17\include" -I"C:\Program Files (x86)\PGI\win32\2013\cuda\5.0\include" -I"C:\Program Files (x86)\PGI\win32\2013\cuda\5.5\include" -I"c:\program files (x86)\pgi\win32\13.10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 11\include" -I"C:\Program Files (x86)\Windows Kits\8.0\Include\shared" -I"C:\Program Files (x86)\Windows Kits\8.0\Include\um" -fastsse -ta=nvidia,cc35 -Minform=warn -Minfo=accel

and:

-Bstatic -Mcuda=cc35,ptxinfo,rdc -ta=nvidia,cc35 -o “C:\Users\KanGuangyuan\Desktop\DynamicParallelism\DynamicParallelism\Win32\Release\DynamicParallelism.exe” cudadevrt.lib

However, the compelation failed, the error message is:

Compiling Project …
Kernel.cuf
ptxas : info : 96 bytes gmem
ptxas : info : Function properties for cudaMalloc
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Function properties for cudaFuncGetAttributes
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Compiling entry function ‘kernel_fatherkernel_’ for ‘sm_35’
ptxas : info : Function properties for kernel_fatherkernel_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 17 registers, 320 bytes cmem[0]
ptxas : info : Compiling entry function ‘kernel_childkernel_’ for ‘sm_35’
ptxas : info : Function properties for kernel_childkernel_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 324 bytes cmem[0]
Main.f90
Linking…
reading response file C:\Users\KANGUA~1\AppData\Local\Temp\pgfortran2apGEbTLdblpRC.lnk
child process exit with signal 2: c:\program files (x86)\pgi\win32\13.10\bin\pgnvd.exe
DynamicParallelism build failed.

and a error message box pumped up, with the prompt message:

“nvlink.exe has stopped working”

How to solve this? Can PGI Visual Fortran compiler utilize the dynamic parallelism?

Nightwish

Hi Nightwish,

This is a know issue (TPR#19604) with dynamic parallelism on Windows that was fixed in the 14.6 release. Unfortunately there is no work around other than updating your compiler version.

Best Regards,
Mat

Hi Mat,

I compiled the same simple example with PVF 14.7 and the flags were -Mcuda=cuda6.0,ptxinfo,rdc,cc35.

  • ------ Build started: Project: PVFProject38, Configuration: Debug x64 ------
    Compiling Project …
    SourceFile1.cuf
    ptxas : info : 112 bytes gmem
    ptxas : info : Compiling entry function ‘kernel_childkernel_’ for ‘sm_35’
    ptxas : info : Function properties for kernel_childkernel_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas : info : Used 10 registers, 324 bytes cmem[0]
    ptxas : info : Function properties for cudaMalloc
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas : info : Function properties for cudaFuncGetAttributes
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas : info : Compiling entry function ‘kernel_fatherkernel_’ for ‘sm_35’
    ptxas : info : Function properties for kernel_fatherkernel_
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
    ptxas : info : Used 34 registers, 320 bytes cmem[0]
    ConsoleApp.f90
    Linking…
    nvlink : warning : SM Arch (‘sm_20’) not found in ‘x64\Debug\SourceFile1.obj’
    nvlink : warning : SM Arch (‘sm_30’) not found in ‘x64\Debug\SourceFile1.obj’
    PVFProject38 build succeeded.

But I got the error message while executing it on the GTX 850M.

0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

Did I do something wrong ?
Thank you.

CY

Hi CY,

This is because you’re compiling for a CC35 device, but running on a CC50 device. Unfortunately, 14.7 didn’t support Maxwell devices so you’ll need to update the compiler release to 15.7 or later to get this to work on a GTX 850M.

  • Mat
% pgf90 -Mcuda=cc35 -V14.7 test.cuf
% a.out
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
% pgf90 -Mcuda=cc50 -V15.7 test.cuf
% a.out
    11.00000        21.00000        31.00000        41.00000
    51.00000        61.00000        71.00000        81.00000
    91.00000        101.0000        12.00000        22.00000
    32.00000        42.00000        52.00000        62.00000
    72.00000        82.00000        92.00000        102.0000
    13.00000        23.00000        33.00000        43.00000
    53.00000        63.00000        73.00000        83.00000
    93.00000        103.0000        14.00000        24.00000
    34.00000        44.00000        54.00000        64.00000
    74.00000        84.00000        94.00000        104.0000
FORTRAN PAUSE: enter <return> or <ctrl>d to continue>
Warning: ieee_inexact is signaling
FORTRAN STOP

Hi Mat,

May I compile the example just use these flags “-Mcuda=cuda6.0,ptxinfo,rdc” ?

CY

Hi Cy,

Assuming that you aren’t calling device routine from one module to another, nor using module device data from one module in another, it might work if you can compile without RDC.

Can you try “-Mcuda=cuda6.0,ptxinfo,nordc”?

  • Mat

Hi Mat,

Sure! With the flag “-Mcuda=cuda6.0,ptxinfo,nordc” on PVF 14.7, I got the following error message…

  • ------ Rebuild All started: Project: PVFProject38, Configuration: Debug x64 ------
    Deleting intermediate and output files for project ‘PVFProject38’, configuration ‘Debug’
    Compiling Project …
    SourceFile1.cuf
    C:\Users\cyfeng\AppData\Local\Temp\pgcudafor2arOPbZhL3zxFJ.gpu(31): error: calling a host function(“cudaDeviceSynchronize”) from a global function(“kernel_fatherkernel_”) is not allowed

1 error detected in the compilation of “C:\Users\cyfeng\AppData\Local\Temp\pgnvd2axKZbf3US01Xi.nv0”.
E:\PVFProject38\SourceFile1.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 14.7-0: compilation aborted
ConsoleApp.f90
SourceFile1.cuf
C:\Users\cyfeng\AppData\Local\Temp\pgcudafor2asqYb2h51IxlF.gpu(31): error: calling a host function(“cudaDeviceSynchronize”) from a global function(“kernel_fatherkernel_”) is not allowed

1 error detected in the compilation of “C:\Users\cyfeng\AppData\Local\Temp\pgnvd2aua_826WpuV-.nv0”.
E:\PVFProject38\SourceFile1.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 14.7-0: compilation aborted
PVFProject38 build failed.
Build log was saved at “file://E:\PVFProject38\x64\Debug\BuildLog.htm”

Hi CY,

Apologies, I missed the original post was about dynamic parallelism and included a device side call to cudaDeviceSynchronize. Without RDC, you’re not going to be able to link to external device routines.

You’ll need to upgrade your compiler to 15.7 or to get support for Maxwell devices.

  • Mat