CUDA Fortran performance issue with allocatable arrays

Hello,
I am seeing the performance degradation of a kernel device code which uses an allocatable device array declared in a Fortran MODULE. This problem seems to happen only on a kepler(cc3.5) GPU. The test01.f90 below shows an example program.

#define NUM_THREADS 128
module cuda_kernel
  integer,parameter::n0=128
  integer,parameter::n=n0**3
  real,allocatable,device::d_A(:),d_B(:),d_C(:)
contains
  attributes(global) subroutine gpu_kernel(n)
    integer,value::n
    integer::i
    i=(blockidx%x-1)*blockdim%x+threadidx%x
    if(i<n+1) d_C(i)=d_A(i)+d_B(i)
  end subroutine
end module

program test
  use cudafor
  use cuda_kernel

  implicit none
  real::A(n),B(n),C(n)
  type(dim3)::dimGrid,dimBlock

  type(cudaDeviceProp) :: prop
  character*20 arg
  integer::idevice,ilen,nargs
  integer::i,istat
  real::clock_start,clock_finish

!---
  nargs = command_argument_count()
  idevice = 0
  do i = 1, nargs
    call get_command_argument(i,arg)
    if ((arg(1:7) .eq. "-device") .and. (i.lt.nargs)) then
      call get_command_argument(i+1,arg)
      read(arg,'(i2)') idevice
    end if
  end do
  istat = cudaSetDevice(idevice)
  istat = cudaGetDeviceProperties(prop,idevice)
  ilen = verify(prop%name, ' ', .true.)
  write (*,900) prop%name(1:ilen), &
                real(prop%clockRate)/1000.0, &
                real(prop%totalGlobalMem)/1024.0/1024.0

  dimGrid=dim3((n-1)/NUM_THREADS+1,1,1)
  dimBlock=dim3(NUM_THREADS,1,1)

  do i=1,n
    A(i)=1.0
    B(i)=2.0
  end do

  allocate(d_A(n),d_B(n),d_C(n))

  istat=cudaThreadSynchronize()
  call cpu_time(clock_start)
  d_A=A
  d_B=B
  istat=cudaThreadSynchronize()

  call cpu_time(clock_finish)
  print *,"CPU Time for HtoDcopy = ",     &
          (clock_finish - clock_start), " seconds"
  istat=cudaThreadSynchronize()

  call cpu_time(clock_start)

  do i=1,100000
    call gpu_kernel<<<dimGrid,dimBlock>>>(n)
  end do

  istat = cudaThreadSynchronize()
  call cpu_time(clock_finish)
  print *,"GPU Time for kernel   = ",     &
            (clock_finish - clock_start), " seconds"

  istat = cudaThreadSynchronize()
  call cpu_time(clock_start)

  C = d_C
  istat = cudaThreadSynchronize()
  call cpu_time(clock_finish)
  print *, "CPU Time for DtoHcopy = ",     &
            (clock_finish - clock_start), " seconds"

!  print *, "C(n) = ", C(n)

900 format('\nDevice:',a,', ',f6.1,' MHz clock, ',f6.1,' MB memory.\n')
end program

As for the performance of a kernel code of this program, double is slower than the same program except a difference in “DIMENSION attribute” for arrays. I have checked out other programs which are different from the DIMENSION attribute, test02.f90 and test03.f90 as follows.

test01.f90 : using ALLOCATABLE arrays in the Fortran MODULE
test02.f90 : using explicit shape arrays in the Fortran MODULE (statically specifies array size)
test03.f90 : using arrays passed by an argument from main program

An abstract on array declaration.

----- test01.f90 -------
(In the case of using ALLOCATABLE arrays in the Fortran MODULE)
module cuda_kernel
  integer,parameter::n0=128
  integer,parameter::n=n0**3
  real,allocatable,device::d_A(:),d_B(:),d_C(:)  !allocatable arrays
contains
  attributes(global) subroutine gpu_kernel(n)
(snip)
----- test02.f90 ------
(In the case of using explicit shape arrays in the Fortran MODULE)
module cuda_kernel
  integer,parameter::n0=128
  integer,parameter::n=n0**3
  real,device::d_A(n),d_B(n),d_C(n)       !explicit shape arrays
contains
  attributes(global) subroutine gpu_kernel(n)
(snip)
---- test03.f90 -------
(In the case of using arrays passed by an argument from main program)
module cuda_kernel
  integer,parameter::n0=128
  integer,parameter::n=n0**3
contains
  attributes(global) subroutine gpu_kernel(n,d_A,d_B,d_C) !arrays passed by argument
    integer,value::n
    real,device::d_A(n),d_B(n),d_C(n)
(snip)

The below are a result on a kepler GPU(Tesla K20c) using PGI 14.7 + CUDA 6.0.

$ pgf90 -Mcuda=cuda6.0 -O3 -Mpreprocess test01.f90 -o 1.out
$ 1.out -device 0
Device:Tesla K20c,  705.5 MHz clock, 4799.6 MB memory.
 CPU Time for HtoDcopy =    4.2929649E-03  seconds
 GPU Time for kernel   =     34.13584      seconds   ***
 CPU Time for DtoHcopy =    3.1127930E-03  seconds

$ pgf90 -Mcuda=cuda6.0 -O3 -Mpreprocess test02.f90 -o 2.out
$ 2.out -device 0
Device:Tesla K20c,  705.5 MHz clock, 4799.6 MB memory.
 CPU Time for HtoDcopy =    4.6079159E-03  seconds
 GPU Time for kernel   =     17.68302      seconds   ***
 CPU Time for DtoHcopy =    2.7503967E-03  seconds

$ pgf90 -Mcuda=cuda6.0 -O3 -Mpreprocess test03.f90 -o 3.out
$ 3.out -device 0
Device:Tesla K20c,  705.5 MHz clock, 4799.6 MB memory.
 CPU Time for HtoDcopy =    4.3530464E-03  seconds 
 GPU Time for kernel   =     17.69505      seconds   ***
 CPU Time for DtoHcopy =    3.5820007E-03  seconds

The below table shows a summary measured on three devices.

Performance Summary (seconds) / PGI 14.7 + cuda 6.0 for Linux
================================================================================
                Tesla K20c(CC3.5)  Geforce 580(CC2.0)    tesla c2075
-----------------------------------------------------------------------------
test01.f90*         34.1*                15.6                 24.2
test02.f90          17.7                 15.0                 22.1
test03.f90          17.7                 14.9                 21.8

The test01’s 34.1 seconds is double slower than test02/test03’s time only on a Kepler(K20c).
I have looked into the behavior of test01.f90, the outlooks are as follows.
(1) This phenomenon happens only when the program works on a kepler GPU.
(2) A degradation ratio is about 2 times slower than other program.
(3) Other Fermi GPUs(CC2x) shows resonable performance.
(4) This performance issue has continued since PGI 13.1.
(5) As for the test01 executable generated by PGI 12.10, such a problem don’t happen even though it runs on a kepler. [/code]

I’m not sure this performance issue is due to the specification only on a kepler GPUs or due to any problem. But I think this performance degradation ratio is so serious for a kepler. I would appreciate it very much if anyone could advise me anything.

Thank you in advance.
Kato

Hi Kato,

Try enabling the L1 cache for global stores (-Mcuda=loadcache:L1). This is default for Fermi (cc20) devices. However for Kepler (cc35) devices, global stores are cached in L2.

% pgfortran test1.F90 -Mcuda=6.0 -V14.7 -O3 -o test_cc35.out
% test_cc35.out

Device:Tesla K40c,  745.0 MHz clock, ****** MB memory.

 CPU Time for HtoDcopy =    3.1139851E-03  seconds
 GPU Time for kernel   =     34.89708      seconds
 CPU Time for DtoHcopy =    2.3727417E-03  seconds

% pgfortran test1.F90 -Mcuda=6.0,loadcache:L1 -V14.7 -O3 -o test_cc35.out
% test_cc35.out

Device:Tesla K40c,  745.0 MHz clock, ****** MB memory.

 CPU Time for HtoDcopy =    6.8929195E-03  seconds
 GPU Time for kernel   =     14.41163      seconds
 CPU Time for DtoHcopy =    2.3527145E-03  seconds
  • Mat

Thanks a lot, Mat.

I tried to enable the L1 loadcache for global load/store for kepler GPU. Unfortunately this capability is unable to function for tesla K20 GPU.

$ pgf90 test1.F90 -Mcuda=cuda6.0,loadcache:L1 -V14.7 -O3  -o test_cc35.out
$ test_cc35.out

Device:Tesla K20c,  705.5 MHz clock, 4799.6 MB memory.

 CPU Time for HtoDcopy =    4.1198730E-03  seconds
 GPU Time for kernel   =     35.33527      seconds
 CPU Time for DtoHcopy =    3.1127930E-03  seconds

According to CUDA document, I understand the L1 cache for K20 GPU is reserved only for local memory accesses. But K40c GPU which you tried to test is able to select the Fermi-style behavior of caching both global and local loads. So your K40c GPU’s behavior makes sense. OK. I understood this problem is due to the specification of L1 cache only for K20(GK110) GPU.

Thanks,
Kato

(Reference)

1.4.4.2. L1 Cache
L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache).

GK110B-based products such as the Tesla K40 GPU Accelerator retain this behavior by default but also allow applications to opt-in to the Fermi-style behavior of caching both global and local loads in L1. To select this mode, pass the -Xptxas -dlcm=ca flag to nvcc at compile time.