Hello,
I am seeing the performance degradation of a kernel device code which uses an allocatable device array declared in a Fortran MODULE. This problem seems to happen only on a kepler(cc3.5) GPU. The test01.f90 below shows an example program.
#define NUM_THREADS 128
module cuda_kernel
integer,parameter::n0=128
integer,parameter::n=n0**3
real,allocatable,device::d_A(:),d_B(:),d_C(:)
contains
attributes(global) subroutine gpu_kernel(n)
integer,value::n
integer::i
i=(blockidx%x-1)*blockdim%x+threadidx%x
if(i<n+1) d_C(i)=d_A(i)+d_B(i)
end subroutine
end module
program test
use cudafor
use cuda_kernel
implicit none
real::A(n),B(n),C(n)
type(dim3)::dimGrid,dimBlock
type(cudaDeviceProp) :: prop
character*20 arg
integer::idevice,ilen,nargs
integer::i,istat
real::clock_start,clock_finish
!---
nargs = command_argument_count()
idevice = 0
do i = 1, nargs
call get_command_argument(i,arg)
if ((arg(1:7) .eq. "-device") .and. (i.lt.nargs)) then
call get_command_argument(i+1,arg)
read(arg,'(i2)') idevice
end if
end do
istat = cudaSetDevice(idevice)
istat = cudaGetDeviceProperties(prop,idevice)
ilen = verify(prop%name, ' ', .true.)
write (*,900) prop%name(1:ilen), &
real(prop%clockRate)/1000.0, &
real(prop%totalGlobalMem)/1024.0/1024.0
dimGrid=dim3((n-1)/NUM_THREADS+1,1,1)
dimBlock=dim3(NUM_THREADS,1,1)
do i=1,n
A(i)=1.0
B(i)=2.0
end do
allocate(d_A(n),d_B(n),d_C(n))
istat=cudaThreadSynchronize()
call cpu_time(clock_start)
d_A=A
d_B=B
istat=cudaThreadSynchronize()
call cpu_time(clock_finish)
print *,"CPU Time for HtoDcopy = ", &
(clock_finish - clock_start), " seconds"
istat=cudaThreadSynchronize()
call cpu_time(clock_start)
do i=1,100000
call gpu_kernel<<<dimGrid,dimBlock>>>(n)
end do
istat = cudaThreadSynchronize()
call cpu_time(clock_finish)
print *,"GPU Time for kernel = ", &
(clock_finish - clock_start), " seconds"
istat = cudaThreadSynchronize()
call cpu_time(clock_start)
C = d_C
istat = cudaThreadSynchronize()
call cpu_time(clock_finish)
print *, "CPU Time for DtoHcopy = ", &
(clock_finish - clock_start), " seconds"
! print *, "C(n) = ", C(n)
900 format('\nDevice:',a,', ',f6.1,' MHz clock, ',f6.1,' MB memory.\n')
end program
As for the performance of a kernel code of this program, double is slower than the same program except a difference in “DIMENSION attribute” for arrays. I have checked out other programs which are different from the DIMENSION attribute, test02.f90 and test03.f90 as follows.
test01.f90 : using ALLOCATABLE arrays in the Fortran MODULE
test02.f90 : using explicit shape arrays in the Fortran MODULE (statically specifies array size)
test03.f90 : using arrays passed by an argument from main program
An abstract on array declaration.
----- test01.f90 -------
(In the case of using ALLOCATABLE arrays in the Fortran MODULE)
module cuda_kernel
integer,parameter::n0=128
integer,parameter::n=n0**3
real,allocatable,device::d_A(:),d_B(:),d_C(:) !allocatable arrays
contains
attributes(global) subroutine gpu_kernel(n)
(snip)
----- test02.f90 ------
(In the case of using explicit shape arrays in the Fortran MODULE)
module cuda_kernel
integer,parameter::n0=128
integer,parameter::n=n0**3
real,device::d_A(n),d_B(n),d_C(n) !explicit shape arrays
contains
attributes(global) subroutine gpu_kernel(n)
(snip)
---- test03.f90 -------
(In the case of using arrays passed by an argument from main program)
module cuda_kernel
integer,parameter::n0=128
integer,parameter::n=n0**3
contains
attributes(global) subroutine gpu_kernel(n,d_A,d_B,d_C) !arrays passed by argument
integer,value::n
real,device::d_A(n),d_B(n),d_C(n)
(snip)
The below are a result on a kepler GPU(Tesla K20c) using PGI 14.7 + CUDA 6.0.
$ pgf90 -Mcuda=cuda6.0 -O3 -Mpreprocess test01.f90 -o 1.out
$ 1.out -device 0
Device:Tesla K20c, 705.5 MHz clock, 4799.6 MB memory.
CPU Time for HtoDcopy = 4.2929649E-03 seconds
GPU Time for kernel = 34.13584 seconds ***
CPU Time for DtoHcopy = 3.1127930E-03 seconds
$ pgf90 -Mcuda=cuda6.0 -O3 -Mpreprocess test02.f90 -o 2.out
$ 2.out -device 0
Device:Tesla K20c, 705.5 MHz clock, 4799.6 MB memory.
CPU Time for HtoDcopy = 4.6079159E-03 seconds
GPU Time for kernel = 17.68302 seconds ***
CPU Time for DtoHcopy = 2.7503967E-03 seconds
$ pgf90 -Mcuda=cuda6.0 -O3 -Mpreprocess test03.f90 -o 3.out
$ 3.out -device 0
Device:Tesla K20c, 705.5 MHz clock, 4799.6 MB memory.
CPU Time for HtoDcopy = 4.3530464E-03 seconds
GPU Time for kernel = 17.69505 seconds ***
CPU Time for DtoHcopy = 3.5820007E-03 seconds
The below table shows a summary measured on three devices.
Performance Summary (seconds) / PGI 14.7 + cuda 6.0 for Linux
================================================================================
Tesla K20c(CC3.5) Geforce 580(CC2.0) tesla c2075
-----------------------------------------------------------------------------
test01.f90* 34.1* 15.6 24.2
test02.f90 17.7 15.0 22.1
test03.f90 17.7 14.9 21.8
The test01’s 34.1 seconds is double slower than test02/test03’s time only on a Kepler(K20c).
I have looked into the behavior of test01.f90, the outlooks are as follows.
(1) This phenomenon happens only when the program works on a kepler GPU.
(2) A degradation ratio is about 2 times slower than other program.
(3) Other Fermi GPUs(CC2x) shows resonable performance.
(4) This performance issue has continued since PGI 13.1.
(5) As for the test01 executable generated by PGI 12.10, such a problem don’t happen even though it runs on a kepler. [/code]
I’m not sure this performance issue is due to the specification only on a kepler GPUs or due to any problem. But I think this performance degradation ratio is so serious for a kepler. I would appreciate it very much if anyone could advise me anything.
Thank you in advance.
Kato