About dynamic parallelism of CUDA Fortran

Nightwish · December 3, 2014, 1:58am

Hello!

I write a very simple example to test the dynamic parallelism by PGI Visual Fortran 13.10, the source codes are listed below:

Kernel.cuf:

module Kernel
	real,allocatable,device :: a_d(:,:)
	integer,parameter :: m=10,n=4

contains

	attributes(global) subroutine FatherKernel()
		implicit none

		integer :: i,istat

		i=blockDim%x*(blockIdx%x-1)+threadIdx%x
		
		if (i<=m) then
			call ChildKernel<<<2,2>>>(i)
			istat=cudaDeviceSynchronize()
		end if
	end subroutine

	attributes(global) subroutine ChildKernel(row)
		implicit none

		integer,value :: row

		integer :: i

		i=blockDim%x*(blockIdx%x-1)+threadIdx%x
		
		if (i<=n) then
			a_d(row,i)=row*10+i
		end if
	end subroutine
end module

Main.f90:

program Main
    use Kernel
    use cudafor
    implicit none
    
    real,allocatable :: a(:,:)
    integer :: istat
    
    allocate(a(m,n),a_d(m,n))
    a=0.0
    a_d=a
    
    call FatherKernel<<<2,5>>>()
    istat=cudaDeviceSynchronize()
    
    a=a_d
    
    print *,a
    deallocate(a,a_d)
    
    pause
    stop
end program

The compile command is:

-Bstatic -Mbackslash -Mcuda=cc35,ptxinfo,rdc -I"C:\Program Files\CULA\R17\include" -I"C:\Program Files (x86)\PGI\win32\2013\cuda\5.0\include" -I"C:\Program Files (x86)\PGI\win32\2013\cuda\5.5\include" -I"c:\program files (x86)\pgi\win32\13.10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 11\include" -I"C:\Program Files (x86)\Windows Kits\8.0\Include\shared" -I"C:\Program Files (x86)\Windows Kits\8.0\Include\um" -fastsse -ta=nvidia,cc35 -Minform=warn -Minfo=accel

and:

-Bstatic -Mcuda=cc35,ptxinfo,rdc -ta=nvidia,cc35 -o “C:\Users\KanGuangyuan\Desktop\DynamicParallelism\DynamicParallelism\Win32\Release\DynamicParallelism.exe” cudadevrt.lib

However, the compelation failed, the error message is:

Compiling Project …
Kernel.cuf
ptxas : info : 96 bytes gmem
ptxas : info : Function properties for cudaMalloc
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Function properties for cudaFuncGetAttributes
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Compiling entry function ‘kernel_fatherkernel_’ for ‘sm_35’
ptxas : info : Function properties for kernel_fatherkernel_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 17 registers, 320 bytes cmem[0]
ptxas : info : Compiling entry function ‘kernel_childkernel_’ for ‘sm_35’
ptxas : info : Function properties for kernel_childkernel_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 324 bytes cmem[0]
Main.f90
Linking…
reading response file C:\Users\KANGUA~1\AppData\Local\Temp\pgfortran2apGEbTLdblpRC.lnk
child process exit with signal 2: c:\program files (x86)\pgi\win32\13.10\bin\pgnvd.exe
DynamicParallelism build failed.

and a error message box pumped up, with the prompt message:

“nvlink.exe has stopped working”

How to solve this? Can PGI Visual Fortran compiler utilize the dynamic parallelism?

Nightwish

MatColgrove · December 3, 2014, 6:02pm

Hi Nightwish,

This is a know issue (TPR#19604) with dynamic parallelism on Windows that was fixed in the 14.6 release. Unfortunately there is no work around other than updating your compiler version.

Best Regards,
Mat

cyfengMIT · November 28, 2016, 3:42am

Hi Mat,

I compiled the same simple example with PVF 14.7 and the flags were -Mcuda=cuda6.0,ptxinfo,rdc,cc35.

------ Build started: Project: PVFProject38, Configuration: Debug x64 ------
Compiling Project …
SourceFile1.cuf
ptxas : info : 112 bytes gmem
ptxas : info : Compiling entry function ‘kernel_childkernel_’ for ‘sm_35’
ptxas : info : Function properties for kernel_childkernel_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 10 registers, 324 bytes cmem[0]
ptxas : info : Function properties for cudaMalloc
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Function properties for cudaFuncGetAttributes
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Compiling entry function ‘kernel_fatherkernel_’ for ‘sm_35’
ptxas : info : Function properties for kernel_fatherkernel_
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 34 registers, 320 bytes cmem[0]
ConsoleApp.f90
Linking…
nvlink : warning : SM Arch (‘sm_20’) not found in ‘x64\Debug\SourceFile1.obj’
nvlink : warning : SM Arch (‘sm_30’) not found in ‘x64\Debug\SourceFile1.obj’
PVFProject38 build succeeded.

But I got the error message while executing it on the GTX 850M.

0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)

Did I do something wrong ?
Thank you.

CY

MatColgrove · November 28, 2016, 7:31pm

Hi CY,

This is because you’re compiling for a CC35 device, but running on a CC50 device. Unfortunately, 14.7 didn’t support Maxwell devices so you’ll need to update the compiler release to 15.7 or later to get this to work on a GTX 850M.

Mat

% pgf90 -Mcuda=cc35 -V14.7 test.cuf
% a.out
0: ALLOCATE: copyin Symbol Memcpy FAILED:13(invalid device symbol)
% pgf90 -Mcuda=cc50 -V15.7 test.cuf
% a.out
    11.00000        21.00000        31.00000        41.00000
    51.00000        61.00000        71.00000        81.00000
    91.00000        101.0000        12.00000        22.00000
    32.00000        42.00000        52.00000        62.00000
    72.00000        82.00000        92.00000        102.0000
    13.00000        23.00000        33.00000        43.00000
    53.00000        63.00000        73.00000        83.00000
    93.00000        103.0000        14.00000        24.00000
    34.00000        44.00000        54.00000        64.00000
    74.00000        84.00000        94.00000        104.0000
FORTRAN PAUSE: enter <return> or <ctrl>d to continue>
Warning: ieee_inexact is signaling
FORTRAN STOP

cyfengMIT · November 29, 2016, 10:12am

Hi Mat,

May I compile the example just use these flags “-Mcuda=cuda6.0,ptxinfo,rdc” ?

CY

MatColgrove · November 29, 2016, 11:39pm

Hi Cy,

Assuming that you aren’t calling device routine from one module to another, nor using module device data from one module in another, it might work if you can compile without RDC.

Can you try “-Mcuda=cuda6.0,ptxinfo,nordc”?

Mat

cyfengMIT · November 30, 2016, 1:31am

Hi Mat,

Sure! With the flag “-Mcuda=cuda6.0,ptxinfo,nordc” on PVF 14.7, I got the following error message…

------ Rebuild All started: Project: PVFProject38, Configuration: Debug x64 ------
Deleting intermediate and output files for project ‘PVFProject38’, configuration ‘Debug’
Compiling Project …
SourceFile1.cuf
C:\Users\cyfeng\AppData\Local\Temp\pgcudafor2arOPbZhL3zxFJ.gpu(31): error: calling a host function(“cudaDeviceSynchronize”) from a global function(“kernel_fatherkernel_”) is not allowed

1 error detected in the compilation of “C:\Users\cyfeng\AppData\Local\Temp\pgnvd2axKZbf3US01Xi.nv0”.
E:\PVFProject38\SourceFile1.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 14.7-0: compilation aborted
ConsoleApp.f90
SourceFile1.cuf
C:\Users\cyfeng\AppData\Local\Temp\pgcudafor2asqYb2h51IxlF.gpu(31): error: calling a host function(“cudaDeviceSynchronize”) from a global function(“kernel_fatherkernel_”) is not allowed

1 error detected in the compilation of “C:\Users\cyfeng\AppData\Local\Temp\pgnvd2aua_826WpuV-.nv0”.
E:\PVFProject38\SourceFile1.cuf(1) : error F0155 : Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code
PGF90/x86-64 Windows 14.7-0: compilation aborted
PVFProject38 build failed.
Build log was saved at “file://E:\PVFProject38\x64\Debug\BuildLog.htm”

MatColgrove · December 2, 2016, 7:07pm

Hi CY,

Apologies, I missed the original post was about dynamic parallelism and included a device side call to cudaDeviceSynchronize. Without RDC, you’re not going to be able to link to external device routines.

You’ll need to upgrade your compiler to 15.7 or to get support for Maxwell devices.

Mat

Topic		Replies	Views
Dynamic parallelism in PVF cannot compile Legacy PGI Compilers	2	6597	March 12, 2015
CUDA Dynamic Parallelism undefined reference to __fatbinwrap Legacy PGI Compilers	5	11967	April 28, 2015
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21310	February 26, 2010
Dynamic Parallelism : code: 30, reason: unknown error from cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	1750	July 14, 2015
Dynamic Parallelism CUDA Programming and Performance	5	7904	June 24, 2014
CUDA Fortran can not compile Legacy PGI Compilers	5	10186	August 28, 2014
How compile the kernel subroutine containing dgetrf Legacy PGI Compilers	4	5718	December 4, 2013
Problems with the device subprograms Legacy PGI Compilers	4	5913	September 11, 2013
First try compile errors Legacy PGI Compilers	15	14344	August 29, 2013
Declaring local arrays in device code Legacy PGI Compilers	16	9067	June 8, 2012

About dynamic parallelism of CUDA Fortran

Related topics