I have a whole collection of programs that simply do not perform on the GPU. I guess that I’m doing the same thing wrong systematically, but I cannot figure out what it is. This bench mark (see below) generates a number of arrays of “random” integers on the GPU, one for each thread. Sorts them twice, ascending and descending, and returns the largest and smallest values to the host. I compare the speed of the sort on the GPU with the same when I run this program in emulation mode on the CPU.
The GPU version runs more than two orders of magnitude slower for each individual sort. This makes no sense at all given that the ratio of CPU and GPU (Tesla) clock speeds is about three.
Compile with:
pgfortran -ta=nvidia out.CUF -fast -o sortGPU
Run with:
time sortGPU 1000000 32 1
Result:
n,nthread,nblock 1000000 32 1
0
0
0
-9223370653486555947 9223369301284663641
GraPU version: time 2.387E+00 [Note this time has been divided by nthread*nblock]Time spent in user mode (CPU seconds) : 76.392s
Compile with:
pgfortran -Mcuda=emu out.CUF -fast -o sortCPU[
Run with:
time sortCPU 1000000 32 1
Result
n,nthread,nblock 1000000 32 1
0
0
0
-9223370653486555947 9223369301284663641
GraPU version: time 6.444E-02Time spent in user mode (CPU seconds) : 12.370s
Note that a 32-fold parallel GPU run takes 76 second, and a 32 times serial CPU run 12 seconds.
Although I have not done so for this program, I have looked at the results of the profiler in other cases to see if by any chance I was constantly copying from host to device, but I could not find anything that indicates that this was happening.
Here is the infamous program:
module types
integer, parameter :: INTX=8
end module types
module sort
use types
use cudafor
use types
implicit none
contains
attributes(device) subroutine sort_int(n,int,isign)
! sort_int: purpose: sort array in ascending or descending order
use types
implicit none
integer :: n,isign
integer(INTX) :: int(n)
logical :: precedes
integer(INTX) :: i1,i2,int1
integer :: l,ir,i,j
precedes(i1,i2)=i1 .le. i2
if(n .le. 1) return
l=n/2+1
ir=n
if(isign .ge. 0) then ! ascending sort
10 continue
if(l.gt.1)then
l=l-1
int1=int(l)
else
int1=int(ir)
int(ir)=int(1)
ir=ir-1
if(ir.eq.1) then
int(1)=int1
goto 30
endif
endif
i=l
j=l+l
20 if(j.le.ir) then
if(j.lt.ir) then
if(precedes(int(j),int(j+1))) j=j+1
endif
if(precedes(int1,int(j))) then
int(i)=int(j)
i=j
j=j+j
else
j=ir+1
endif
go to 20
endif
int(i)=int1
go to 10
else ! descending sort
11 continue
if(l.gt.1)then
l=l-1
int1=int(l)
else
int1=int(ir)
int(ir)=int(1)
ir=ir-1
if(ir.eq.1) then
int(1)=int1
goto 30
endif
endif
i=l
j=l+l
21 if(j.le.ir) then
if(j.lt.ir) then
if(.not.precedes(int(j),int(j+1))) j=j+1
endif
if(.not.precedes(int1,int(j))) then
int(i)=int(j)
i=j
j=j+j
else
j=ir+1
endif
go to 21
endif
int(i)=int1
go to 11
endif
30 continue
end subroutine sort_int
attributes(global) subroutine bench(n,mm,nthread,nblock,intg)
integer, value :: n,nthread,nblock
integer(INTX) :: mm(4,nthread,nblock)
integer(INTX) :: intg(n,nthread)
integer :: i,isign,j
j=threadidx%x
intg(1,j)=j
do i=2,n
intg(i,j)=intg(i-1,j)*3
end do
isign=1
call sort_int(n,intg(1,threadidx%x),isign)
mm(1,threadidx%x,blockidx%x)=intg(1,threadidx%x)
mm(2,threadidx%x,blockidx%x)=intg(n,threadidx%x)
isign=-1
call sort_int(n,intg(1,threadidx%x),isign)
mm(3,threadidx%x,blockidx%x)=intg(1,threadidx%x)
mm(4,threadidx%x,blockidx%x)=intg(n,threadidx%x)
end subroutine bench
end module sort
program main
use types
use sort
implicit none
integer :: i,n,t1,t2,ticks,nthread,nblock,istat
real :: time
character*20 :: input
integer(INTX), allocatable :: mm(:,:,:)
integer(INTX), device, allocatable :: d_mm(:,:,:)
integer(INTX), device, allocatable :: d_intg(:,:)
integer(INTX), allocatable :: intg(:,:)
call getarg(1,input)
read(input,*) n
nthread=1
nblock=1
call getarg(2,input)
read(input,*) nthread
call getarg(3,input)
read(input,*) nblock
write(*,*) 'n,nthread,nblock',n,nthread,nblock
allocate(d_intg(n,nthread),stat=istat)
write(*,*) istat
allocate(d_mm(4,nthread,nblock),stat=istat)
write(*,*) istat
allocate(mm(4,nthread,nblock),stat=istat)
write(*,*) istat
allocate(intg(n,nthread))
d_mm=0
d_intg=0
call system_clock(t1)
call bench<<<nblock,nthread>>>(n,d_mm,nthread,nblock,d_intg)
mm=d_mm
d_intg=0
call system_clock(t1)
call bench<<<nblock,nthread>>>(n,d_mm,nthread,nblock,d_intg)
mm=d_mm
call system_clock(t2,ticks)
time=real(t2-t1)/ticks
write(*,*) minval(mm),maxval(mm)
write(*,'((a),es10.3)') 'GraPU version: time',time/(nthread*nblock)
end program main
[/quote]