Improve Kernel Performance

When I run the following kernel code the performance is not as good as I would expect. On the CPU it runs in 93.22 seconds; with CUDA Fortran 7.12 seconds; and with the PGI accelerator 2.00 seconds. I am attempting to parallelize approximately 1,054,200 loops but the results don’t seem to scale. How can I know what order of magnitude of improvement I should expect by parallelizing the code? Also is there a way to tell how may threads are actually executing the code? How could the kernel performance be improved?

Thanks for your help

module NORMALIZE_KERNEL
use cudafor

contains

attributes(global) subroutine NORMALIZE_IMAGE_KERNEL(P_RAD, M_RAD, IEND_MEMBER, END_MEMBER, DOTPR_MIN, DOTPR_MAX, DOTPR, NWL, NX, NY, NB)

real, device :: M_RAD(NWL, NB), P_RAD(NX, NY, NWL), DOTPR_MIN(0:255), DOTPR_MAX(0:255), DOTPR(NX, NY)
integer, device :: IEND_MEMBER(NX, NY), END_MEMBER(0:255,2)

integer, value :: NWL, NB, NX, NY, KMAX
integer :: tx, ibk, iwl, i, kk, iy, ix
real :: bl, ibk0, dotpr_matl, iem, dotpr0
real, parameter :: EPSMIN4 = 1.1754944E-38
real, parameter :: CLOSENESS_OF_FIT = 0.90

tx = threadidx%x

i = ( blockidx%x-1 ) * blockdim%x + tx

if ( i .le. NXNY ) then
kk = (i-1)/NX
iy = kk+1
ix = i-kk
NX

bl=0.0
do iwl = 1,NWL
if ( P_RAD(ix,iy,iwl) > 0.0 ) bl = bl + P_RAD(ix,iy,iwl)**2
end do
if ( bl < EPSMIN4 ) then
IEND_MEMBER(ix,iy) = 255
ibk0 = 255
bl = 1.0
else
IEND_MEMBER(ix,iy) = 254
ibk0=254
end if
bl = sqrt(bl)
do iwl = 1,NWL
P_RAD(ix,iy,iwl) = P_RAD(ix,iy,iwl)/bl
end do

dotpr_matl = 0.0
do ibk = 1,NB
iem = END_MEMBER(ibk,1)
dotpr0 = 0.0
do iwl = 1,NWL
dotpr0 = dotpr0 + P_RAD(ix,iy,iwl) * M_RAD(iwl,ibk)
end do

if ( dotpr0 >= CLOSENESS_OF_FIT .AND. dotpr0 > dotpr_matl .AND. iem/=39 .AND. iem/=49 .AND. iem/=17 .AND. iem/=12 .AND. iem/=13 .AND. iem/=16 .AND. iem/=18) then
dotpr_matl = dotpr0
IEND_MEMBER(ix,iy) = END_MEMBER(ibk,1)
DOTPR(ix,iy) = dotpr0
ibk0 = ibk
end if
end do

if ( IEND_MEMBER(ix,iy) == 0 ) END_MEMBER(0,2) = END_MEMBER(0,2) + 1
do ibk = 1, NB
if ( IEND_MEMBER(ix,iy) == END_MEMBER(ibk,1) ) END_MEMBER(ibk,2) = END_MEMBER(ibk,2) + 1
end do
if ( DOTPR(ix,iy) < DOTPR_MIN(ibk0) ) DOTPR_MIN(ibk0) = DOTPR(ix,iy)
if ( DOTPR(ix,iy) > DOTPR_MAX(ibk0) ) DOTPR_MAX(ibk0) = DOTPR(ix,iy)
end if

call syncthreads()

end subroutine

end module NORMALIZE_KERNEL


PROGRAM TESTPGI

USE CONSTANTSPGI
USE NORMALIZE_KERNEL

REAL , PARAMETER :: CLOSENESS_OF_FIT=0.90

INTEGER, PARAMETER :: NB = 64
INTEGER :: I, J, K, KK, IX, IY, IWL, OUTERLOOP, OUTERLOOPMAX, NX, NY, NWL, END_MEMBER(0:255,2)
INTEGER (KIND=KI4), ALLOCATABLE :: IEND_MEMBER(:,:)
REAL, ALLOCATABLE :: P_RAD(:,:,:), M_RAD(:,:), DOTPR(:,:)
REAL :: EXECTIME(5), BL, DOTPR_MATL, DOTPR0
REAL, DEVICE, ALLOCATABLE :: M_RAD_DEV(:,:), P_RAD_DEV(:,:,:), DOTPR_DEV(:,:)
REAL, DEVICE :: DOTPR_MIN(0:255), DOTPR_MAX(0:255)
INTEGER (KIND=KI4), DEVICE, ALLOCATABLE :: IEND_MEMBER_DEV(:,:)
INTEGER, DEVICE :: END_MEMBER_DEV(0:255,2)

DATA ((END_MEMBER(I,J),J=1,2),I=0,NB) / 0,0, &
1,0, 2,0, 3,0, 4,0, 5,0, 6,0, 7,0, 8,0, 9,0, 10,0, 11,0, 12,0, 13,0, 14,0, 15,0, 16,0, &
17,0, 18,0, 19,0, 20,0, 21,0, 22,0, 23,0, 24,0, 25,0, 26,0, 27,0, 28,0, 29,0, 30,0, 31,0, 32,0, 33,0, &
34,0, 35,0, 36,0, 37,0, 38,0, 39,0, 40,0, 41,0, 42,0, 43,0, 44,0, 45,0, 46,0, 47,0, 48,0, 49,0, 50,0, &
51,0, 52,0, 53,0, 54,0, 55,0, 56,0, 57,0, 58,0, 59,0, 60,0, 61,0, 62,0, 63,0, 64,0/
! 65,0, 66,0, 67,0, 68,0, 69,0, 70,0, 71,0, 72,0, 73,0, &
! 74,0, 75,0, 76,0, 77,0, 78,0, 79,0, 80,0, 81,0/

NX = 753
!NY = 1924
NY = 1400
NWL = 224

OUTERLOOPMAX = NX * NY

ALLOCATE(M_RAD(NWL,NB),STAT=IOS)
ALLOCATE(P_RAD(NX,NY,NWL),STAT=IOS)
ALLOCATE(IEND_MEMBER(NX,NY),DOTPR(NX,NY),STAT=IOS)

DO I = 1,NWL
DO J = 1,NX
DO K = 1,NY
P_RAD(J,K,I) = .005 ** 2
END DO
END DO
END DO

DO I = 1,NWL
DO J = 1,NB
M_RAD(I,J) = .003 ** 2
END DO
END DO

DO IBK = 1,NB
DOTPR_MIN(IBK) = 1.0
DOTPR_MAX(IBK) = -1.0
END DO

CALL CPU_TIME(EXECTIME(1))
ALLOCATE(M_RAD_DEV(NWL,NB),STAT=IOS)
ALLOCATE(P_RAD_DEV(NX,NY,NWL),STAT=IOS)
ALLOCATE(IEND_MEMBER_DEV(NX,NY),DOTPR_DEV(NX,NY),STAT=IOS)
IF (IOS/=0) THEN
STOP ‘ALLOCATE Error 6’

END IF

P_RAD_DEV = P_RAD(1:NX, 1:NY, 1:NWL)
END_MEMBER_DEV = END_MEMBER(0:255, 1:2)
IEND_MEMBER_DEV = IEND_MEMBER(1:NX, 1:NY)
DOTPR_DEV = DOTPR(1:NX, 1:NY)

call NORMALIZE_IMAGE_KERNEL<<<(OUTERLOOPMAX-1)/512+1,512>>>(P_RAD_DEV, M_RAD_DEV, IEND_MEMBER_DEV, END_MEMBER_DEV, DOTPR_MIN, DOTPR_MAX, DOTPR_DEV, NWL, NX, NY, NB)

END_MEMBER(1:NBCKGND, 1:2) = END_MEMBER_DEV
IEND_MEMBER(1:NX, 1:NY) = IEND_MEMBER_DEV

CALL CPU_TIME(EXECTIME(2))
print *, "Time ", EXECTIME(2) - EXECTIME(1)
DEALLOCATE(P_RAD,STAT=IOS)
DEALLOCATE(M_RAD,STAT=IOS)

END PROGRAM TESTPGI

Hi Pebbles,

How can I know what order of magnitude of improvement I should expect by parallelizing the code?

There really isn’t a good way of doing this since there are too many factors.

Also is there a way to tell how may threads are actually executing the code?

In the PGI Accelerator model, using the flag “-ta=nvidia,time” will give profile information including the kernel schedule. For CUDA Fortran, you explicitly set the number of threads when you launch your kernel.

How could the kernel performance be improved?

The only specific thing I see off hand is that you should remove the extraneous call to syncthreads.

Try running your program using the PGI profile utility “pgcollect” with the flag “-cuda=gmem” of “-cuda=branch”. This will gather Host and GPU profile information that you can then review within the PGI profiler PGPROF. (for details please refer to the PGI Tools Guide). This should give you a better idea of where your performance bottlenecks are.


While not necessarily performance relevant, I would suggest modifying your kernel schedule to be two dimensions. i.e. change your kernel launch schedule to:


...
INTEGER, DEVICE :: END_MEMBER_DEV(0:255,2)
TYPE(dim3) :: dimGrid, dimBlock
...
dimBlock = dim3(16,16,1)
dimGrid = dim3((NX+15)/16,(NY+15)/16,1)

call NORMALIZE_IMAGE_KERNEL<<<dimGrid,dimBlock>>>(P_RAD_DEV, M_RAD_DEV, IEND_MEMBER_DEV, END_MEMBER_DEV,  &
                                                  DOTPR_MIN, DOTPR_MAX, DOTPR_DEV, NWL, NX, NY, NB)
...

And then you can simply your device kernel index:

ix = ( blockidx%x-1 ) * blockdim%x + threadidx%x
iy = ( blockidx%y-1 ) * blockdim%y + threadidx%y

if ( ix .le. NX .and. iy.le.NY ) then

Hope this helps,
Mat