why doesn't this dog run?

Peter_Nightingale · March 3, 2011, 4:50pm

I have a whole collection of programs that simply do not perform on the GPU. I guess that I’m doing the same thing wrong systematically, but I cannot figure out what it is. This bench mark (see below) generates a number of arrays of “random” integers on the GPU, one for each thread. Sorts them twice, ascending and descending, and returns the largest and smallest values to the host. I compare the speed of the sort on the GPU with the same when I run this program in emulation mode on the CPU.

The GPU version runs more than two orders of magnitude slower for each individual sort. This makes no sense at all given that the ratio of CPU and GPU (Tesla) clock speeds is about three.

Compile with:

pgfortran -ta=nvidia out.CUF -fast -o sortGPU

Run with:

time sortGPU 1000000 32 1

Result:

n,nthread,nblock 1000000 32 1
0
0
0
-9223370653486555947 9223369301284663641
GraPU version: time 2.387E+00 [Note this time has been divided by nthread*nblock]

Time spent in user mode (CPU seconds) : 76.392s

Compile with:

pgfortran -Mcuda=emu out.CUF -fast -o sortCPU[

Run with:

time sortCPU 1000000 32 1

Result

n,nthread,nblock 1000000 32 1
0
0
0
-9223370653486555947 9223369301284663641
GraPU version: time 6.444E-02

Time spent in user mode (CPU seconds) : 12.370s

Note that a 32-fold parallel GPU run takes 76 second, and a 32 times serial CPU run 12 seconds.

Although I have not done so for this program, I have looked at the results of the profiler in other cases to see if by any chance I was constantly copying from host to device, but I could not find anything that indicates that this was happening.

Here is the infamous program:

module types
  integer, parameter :: INTX=8
end module types
module sort
use types
  use cudafor
  use types
  implicit none
contains
attributes(device) subroutine sort_int(n,int,isign)
  ! sort_int: purpose: sort array in ascending or descending order
  use types
  implicit none
  integer :: n,isign
  integer(INTX) :: int(n)
  logical :: precedes
  integer(INTX) :: i1,i2,int1
  integer :: l,ir,i,j
  precedes(i1,i2)=i1 .le. i2
  if(n .le. 1) return
  l=n/2+1
  ir=n
  if(isign .ge. 0) then ! ascending sort
10 continue
     if(l.gt.1)then
        l=l-1
        int1=int(l)
     else
        int1=int(ir)
        int(ir)=int(1)
        ir=ir-1
        if(ir.eq.1) then
           int(1)=int1
           goto 30
        endif
     endif
     i=l
     j=l+l
20 if(j.le.ir) then
        if(j.lt.ir) then
           if(precedes(int(j),int(j+1))) j=j+1
        endif
        if(precedes(int1,int(j))) then
           int(i)=int(j)
           i=j
           j=j+j
        else
           j=ir+1
        endif
        go to 20
     endif
     int(i)=int1
     go to 10
  else ! descending sort
11 continue
     if(l.gt.1)then
        l=l-1
        int1=int(l)
     else
        int1=int(ir)
        int(ir)=int(1)
        ir=ir-1
        if(ir.eq.1) then
           int(1)=int1
           goto 30
        endif
     endif
     i=l
     j=l+l
21 if(j.le.ir) then
        if(j.lt.ir) then
           if(.not.precedes(int(j),int(j+1))) j=j+1
        endif
        if(.not.precedes(int1,int(j))) then
           int(i)=int(j)
           i=j
           j=j+j
        else
           j=ir+1
        endif
        go to 21
     endif
     int(i)=int1
     go to 11
  endif
30 continue
end subroutine sort_int
  attributes(global) subroutine bench(n,mm,nthread,nblock,intg)
    integer, value :: n,nthread,nblock
    integer(INTX) :: mm(4,nthread,nblock)
    integer(INTX) :: intg(n,nthread)
    integer :: i,isign,j
    j=threadidx%x
    intg(1,j)=j
    do i=2,n
       intg(i,j)=intg(i-1,j)*3
    end do
    isign=1
    call sort_int(n,intg(1,threadidx%x),isign)
    mm(1,threadidx%x,blockidx%x)=intg(1,threadidx%x)
    mm(2,threadidx%x,blockidx%x)=intg(n,threadidx%x)
    isign=-1
    call sort_int(n,intg(1,threadidx%x),isign)
    mm(3,threadidx%x,blockidx%x)=intg(1,threadidx%x)
    mm(4,threadidx%x,blockidx%x)=intg(n,threadidx%x)
  end subroutine bench
end module sort
program main
  use types
  use sort
  implicit none
  integer :: i,n,t1,t2,ticks,nthread,nblock,istat
  real :: time
  character*20 :: input
  integer(INTX), allocatable :: mm(:,:,:)
  integer(INTX), device, allocatable :: d_mm(:,:,:)
  integer(INTX), device, allocatable :: d_intg(:,:)
  integer(INTX), allocatable :: intg(:,:)
  call getarg(1,input)
  read(input,*) n
  nthread=1
  nblock=1
  call getarg(2,input)
  read(input,*) nthread
  call getarg(3,input)
  read(input,*) nblock
  write(*,*) 'n,nthread,nblock',n,nthread,nblock
  allocate(d_intg(n,nthread),stat=istat)
  write(*,*) istat
  allocate(d_mm(4,nthread,nblock),stat=istat)
  write(*,*) istat
  allocate(mm(4,nthread,nblock),stat=istat)
  write(*,*) istat
  allocate(intg(n,nthread))
  d_mm=0
  d_intg=0
  call system_clock(t1)
  call bench<<<nblock,nthread>>>(n,d_mm,nthread,nblock,d_intg)
  mm=d_mm
  d_intg=0
  call system_clock(t1)
  call bench<<<nblock,nthread>>>(n,d_mm,nthread,nblock,d_intg)
  mm=d_mm
  call system_clock(t2,ticks)
  time=real(t2-t1)/ticks
  write(*,*) minval(mm),maxval(mm)
  write(*,'((a),es10.3)') 'GraPU version: time',time/(nthread*nblock)
end program main

[/quote]

MatColgrove · March 3, 2011, 7:55pm

Hi Peter,

The poor performance is due to non-contiguous memory access of intg and m which leads to memory divergence. Dr. Wolfe has a good primer on understanding the CUDA data parallel model which might be helpful in understanding the problem. (Account Login | PGI)

Basically, all threads running on an Streaming Multiprocessor (i.e. a Warp) issue the same instructions at the same time. If you memory is organized so that the data is contiguous across threads (i.e. the threadidx should be used to index the first dimension) then all threads can fetch the memory at the same time in a single block. Otherwise, each thread needs to get it’s own memory while the other threads wait.

So to fix your code, you need to reorganize your data so that threadidx accesses the first dimensions of your arrays.

Something like:

  attributes(global) subroutine bench(n,mm,nthread,nblock,intg)
    integer, value :: n,nthread,nblock
    integer(INTX) :: mm(nthread,nblock,4)
    integer(INTX) :: intg(nthread,n)
    integer :: i,isign,j
    j=threadidx%x
    intg(j,1)=j
    do i=2,n
       intg(j,i)=intg(j,i-1)*3
    end do
    isign=1
    call sort_int(n,intg(threadidx%x,1),isign)
    mm(threadidx%x,blockidx%x,1)=intg(threadidx%x,1)
    mm(threadidx%x,blockidx%x,2)=intg(threadidx%x,n)
    isign=-1
    call sort_int(n,intg(threadidx%x,1),isign)
    mm(threadidx%x,blockidx%x,3)=intg(threadidx%x,1)
    mm(threadidx%x,blockidx%x,4)=intg(threadidx%x,n)
  end subroutine bench

Also, increasing the number of blocks and the block size will help improve the performance relative to the CPU.

Hope this helps,
Mat

Peter_Nightingale · March 3, 2011, 8:32pm

Mat,

Thanks for your reply. I will have to read the reference you sent, but I immediately have a simple question.

If I run with one thread, admittedly is not optimal use of the GPU, the argument of the memory conflict of threads no longer holds, unless somehow one thread manages to conflict with itself or is searching for its lost brethren.

The timings I get with one thread are as follows:
CPU mode: 5 seconds for an array of length 10,000,000
GPU mode: 300 seconds same length

Peter.

MatColgrove · March 3, 2011, 9:25pm

Hi Peter,

A single GPU ALU is actually fairly slow when compared to a x86 CPU. The speed-up from a GPU comes from using hundreds of them at a time. So it’s not unexpected to see this type of difference if using only a single thread.

Mat

Peter_Nightingale · March 4, 2011, 7:41pm

Mat, first of all, thanks for you help and patience!

I think that I did what you suggested; the new program is at the bottom of this post. I also corrected a shortcut I took before, so that now the program works correctly for more than one block. I added a block dimension to the arrays that are being sorted; maybe there is better solution, but I would not know what it is. The extra dimension of course reduces the maximum size of the arrays with which one can deal, and that is a serious limitation.

I’ve played around with numbers of threads and blocks but I cannot get the GPU version to run more than 5 times as fast as the emulated CPU version. Given that our system has four CPUs that can be used in parallel with far fewer special tricks, this makes GPU programming an unjustifiable waste of time, if indeed this is the best one can do with cudafortran.

The reason for my saying this is that I’m trying to figure out whether I should advise my students to learn and use GPU techniques with all their serious current limitations, not to mention bleeding edge effects. Right now my impression is that they would be much more productive without GPUs.

It’s not a conclusion I like, because it would mean that I spent a lot of time money for nothing, but I’m optimistic, and there still is a high probability that I’m doing something stupid.

Here is the latest incarnation of the dog that won’t run:

module types
  integer, parameter :: INTX=8
end module types
module sort
use types
  use cudafor
  use types
  implicit none
contains
attributes(device) subroutine sort_int(m,n0,n,int,isign)
  ! sort_int: purpose: sort array int(m,:) in ascending or descending order
  use types
  implicit none
  integer :: m,n0,n,isign,stride
  integer(INTX) :: int(n0,n)
  logical :: precedes
  integer(INTX) :: i1,i2,int1
  integer :: l,ir,i,j
  precedes(i1,i2)=i1 .le. i2
  if(n .le. 1) return
  l=n/2+1
  ir=n
  if(isign .ge. 0) then ! ascending sort
10 continue
     if(l.gt.1)then
        l=l-1
        int1=int(m,l)
     else
        int1=int(m,ir)
        int(m,ir)=int(m,1)
        ir=ir-1
        if(ir.eq.1) then
           int(m,1)=int1
           goto 30
        endif
     endif
     i=l
     j=l+l
20 if(j.le.ir) then
        if(j.lt.ir) then
           if(precedes(int(m,j),int(m,j+1))) j=j+1
        endif
        if(precedes(int1,int(m,j))) then
           int(m,i)=int(m,j)
           i=j
           j=j+j
        else
           j=ir+1
        endif
        go to 20
     endif
     int(m,i)=int1
     go to 10
  else ! descending sort
11 continue
     if(l.gt.1)then
        l=l-1
        int1=int(m,l)
     else
        int1=int(m,ir)
        int(m,ir)=int(m,1)
        ir=ir-1
        if(ir.eq.1) then
           int(m,1)=int1
           goto 30
        endif
     endif
     i=l
     j=l+l
21 if(j.le.ir) then
        if(j.lt.ir) then
           if(.not.precedes(int(m,j),int(m,j+1))) j=j+1
        endif
        if(.not.precedes(int1,int(m,j))) then
           int(m,i)=int(m,j)
           i=j
           j=j+j
        else
           j=ir+1
        endif
        go to 21
     endif
     int(m,i)=int1
     go to 11
  endif
30 continue
end subroutine sort_int
  attributes(global) subroutine bench(n,mm,nthread,nblock,intg)
    integer, value :: n,nthread,nblock
    integer(INTX) :: mm(4,nthread,nblock)
    integer(INTX) :: intg(nthread,n,nblock)
    integer :: i,isign,j
    j=threadidx%x
    intg(j,1,blockidx%x)=j
    do i=2,n
       intg(j,i,blockidx%x)=intg(j,i-1,blockidx%x)*3
    end do
    isign=1
    call sort_int(threadidx%x,nthread,n,intg(1,1,blockidx%x),isign)
    mm(1,threadidx%x,blockidx%x)=intg(1,threadidx%x,blockidx%x)
    mm(2,threadidx%x,blockidx%x)=intg(n,threadidx%x,blockidx%x)
    isign=-1
    call sort_int(threadidx%x,nthread,n,intg(1,1,blockidx%x),isign)
    mm(3,threadidx%x,blockidx%x)=intg(1,threadidx%x,blockidx%x)
    mm(4,threadidx%x,blockidx%x)=intg(n,threadidx%x,blockidx%x)
  end subroutine bench
end module sort
program main
  use types
  use sort
  implicit none
  integer :: i,n,t1,t2,ticks,nthread,nblock,istat
  real :: time
  character*20 :: input
  integer(INTX), allocatable :: mm(:,:,:)
  integer(INTX), device, allocatable :: d_mm(:,:,:)
  integer(INTX), device, allocatable :: d_intg(:,:,:)
  call getarg(1,input)
  read(input,*) n
  nthread=1
  nblock=1
  call getarg(2,input)
  read(input,*) nthread
  call getarg(3,input)
  read(input,*) nblock
  write(*,*) 'n,nthread,nblock',n,nthread,nblock
  allocate(d_intg(nthread,n,nblock),stat=istat)
  if(istat /= 0) then
     write(*,*) 'allocation of d_intg failed'
     stop 'allocation of d_intg failed'
  end if
  allocate(d_mm(4,nthread,nblock),stat=istat)
  if(istat /= 0) then
     write(*,*) 'allocation of d_mm failed'
     stop 'allocation of d_mm failed'
  end if
  allocate(mm(4,nthread,nblock),stat=istat)
  if(istat /= 0) then
     write(*,*) 'allocation of mm failed'
     stop 'allocation of mm failed'
  end if
  call system_clock(t1)
  call bench<<<nblock,nthread>>>(n,d_mm,nthread,nblock,d_intg)
  mm=d_mm
  call system_clock(t2,ticks)
  time=real(t2-t1)/ticks
  write(*,*) minval(mm),maxval(mm)
  write(*,'((a),es10.3)') 'elapsed time',time
end program main

MatColgrove · March 4, 2011, 10:32pm

Hi Peter,

Given that our system has four CPUs that can be used in parallel with far fewer special tricks, this makes GPU programming an unjustifiable waste of time, if indeed this is the best one can do with cudafortran.

This is the 64k question. Does the performance on a GPU justify the cost to reprogram my application? There is no easy answer.

Some things to consider:

GPU programing can be difficult at first but does get easier.
There are other options such the PGI Accelerator Model which make GPU programing easier.
HPC hardware in general is moving towards massively parallel architectures so investments in reprogramming that may not be beneficial in the short term, may be very beneficial as hardware evolves.

For this program I fixed what looked like an error (the assignment of intg to mm has the old indexing) and modified mm so that threadidx%x is used in the first dimension. Minor issues, but even with this change, my times aren’t any better.

Using the inputs of n=1000000 nthreads=32 nblocks=15 (480,000,000 values), my CPU time is 3:11 and the GPU time is 1:12.

What happens if we use approximately same number of values but make it massively parallel? Setting n=60, nthreads=128, nblocks=64000 (491,520,000 values) gives a CPU time of 1:08 and a GPU time of 0:04. A much better ratio.

The some keys to high performance on a GPU are:

Minimize the amount of data and number of transfer between host and accelerator.
Maximize the bandwidth of data transfers between host and accelerator.
Use lots of MIMD Parallelization to fill the multiprocessors.
Use lots of SIMD Parallelization to fill the cores on each multiprocessors.
Even more MIMD Parallelization to fill multi-threading parallelism to hide the long device memory latency.
Minimize the frequency threads access device memory.
Optimize the stride-1 vector dimension.
Utilize CUDA “shared” memory.

Mat

module types
  integer, parameter :: INTX=8
end module types
module sort
use types
  use cudafor
  use types
  implicit none
contains
attributes(device) subroutine sort_int(m,n0,n,int,isign)
  ! sort_int: purpose: sort array int(m,:) in ascending or descending order
  use types
  implicit none
  integer :: m,n0,n,isign,stride
  integer(INTX) :: int(n0,n)
  logical :: precedes
  integer(INTX) :: i1,i2,int1
  integer :: l,ir,i,j
  precedes(i1,i2)=i1 .le. i2
  if(n .le. 1) return
  l=n/2+1
  ir=n
  if(isign .ge. 0) then ! ascending sort
10 continue
     if(l.gt.1)then
        l=l-1
        int1=int(m,l)
     else
        int1=int(m,ir)
        int(m,ir)=int(m,1)
        ir=ir-1
        if(ir.eq.1) then
           int(m,1)=int1
           goto 30
        endif
     endif
     i=l
     j=l+l
20 if(j.le.ir) then
        if(j.lt.ir) then
           if(precedes(int(m,j),int(m,j+1))) j=j+1
        endif
        if(precedes(int1,int(m,j))) then
           int(m,i)=int(m,j)
           i=j
           j=j+j
        else
           j=ir+1
        endif
        go to 20
     endif
     int(m,i)=int1
     go to 10
  else ! descending sort
11 continue
     if(l.gt.1)then
        l=l-1
        int1=int(m,l)
     else
        int1=int(m,ir)
        int(m,ir)=int(m,1)
        ir=ir-1
        if(ir.eq.1) then
           int(m,1)=int1
           goto 30
        endif
     endif
     i=l
     j=l+l
21 if(j.le.ir) then
        if(j.lt.ir) then
           if(.not.precedes(int(m,j),int(m,j+1))) j=j+1
        endif
        if(.not.precedes(int1,int(m,j))) then
           int(m,i)=int(m,j)
           i=j
           j=j+j
        else
           j=ir+1
        endif
        go to 21
     endif
     int(m,i)=int1
     go to 11
  endif
30 continue
end subroutine sort_int
  attributes(global) subroutine bench(n,mm,nthread,nblock,intg)
    integer, value :: n,nthread,nblock
    integer(INTX) :: mm(nthread,4,nblock)
    integer(INTX) :: intg(nthread,n,nblock)
    integer :: i,isign,j
    j=threadidx%x
    intg(j,1,blockidx%x)=j
    do i=2,n
       intg(j,i,blockidx%x)=intg(j,i-1,blockidx%x)*3
    end do
    isign=1
    call sort_int(threadidx%x,nthread,n,intg(1,1,blockidx%x),isign)
    mm(threadidx%x,1,blockidx%x)=intg(threadidx%x,1,blockidx%x)
    mm(threadidx%x,2,blockidx%x)=intg(threadidx%x,n,blockidx%x)
    isign=-1
    call sort_int(threadidx%x,nthread,n,intg(1,1,blockidx%x),isign)
    mm(threadidx%x,3,blockidx%x)=intg(threadidx%x,1,blockidx%x)
    mm(threadidx%x,4,blockidx%x)=intg(threadidx%x,n,blockidx%x)
  end subroutine bench
end module sort
program main
  use types
  use sort
  implicit none
  integer :: i,n,t1,t2,ticks,nthread,nblock,istat
  real :: time
  character*20 :: input
  integer(INTX), allocatable :: mm(:,:,:)
  integer(INTX), device, allocatable :: d_mm(:,:,:)
  integer(INTX), device, allocatable :: d_intg(:,:,:)
  call getarg(1,input)
  read(input,*) n
  nthread=1
  nblock=1
  call getarg(2,input)
  read(input,*) nthread
  call getarg(3,input)
  read(input,*) nblock
  write(*,*) 'n,nthread,nblock',n,nthread,nblock
  allocate(d_intg(nthread,n,nblock),stat=istat)
  if(istat /= 0) then
     write(*,*) 'allocation of d_intg failed'
     stop 'allocation of d_intg failed'
  end if
  allocate(d_mm(nthread,4,nblock),stat=istat)
  if(istat /= 0) then
     write(*,*) 'allocation of d_mm failed'
     stop 'allocation of d_mm failed'
  end if
  allocate(mm(nthread,4,nblock),stat=istat)
  if(istat /= 0) then
     write(*,*) 'allocation of mm failed'
     stop 'allocation of mm failed'
  end if
  call system_clock(t1)
  call bench<<<nblock,nthread>>>(n,d_mm,nthread,nblock,d_intg)
  mm=d_mm
  call system_clock(t2,ticks)
  time=real(t2-t1)/ticks
  write(*,*) minval(mm),maxval(mm)
  write(*,'((a),es10.3)') 'elapsed time',time
end program main

Peter_Nightingale · March 7, 2011, 2:31pm

Mat,

Thanks, once again!

You wirte:

What happens if we use approximately same number of values but make it massively parallel? Setting n=60, nthreads=128, nblocks=64000 (491,520,000 values) gives a CPU time of 1:08 and a GPU time of 0:04. A much better ratio.

Unfortunately, that choice of parameters (n small and nblocks very large) does not model the kind of computations we are trying to do, namely sorting large arrays of size n and collecting a small number of characteristics of these arrays.

Actually, it’s for precisely this reason that I did not bother rearranging the indices of the mm(4,nthread,nblock) array, but did follow your advice for intg(nthread,n,nblock); 4 is supposed to be much smaller than n, i.e., the memory conflicts caused by mm do not matter.

Nevertheless, there may be a lesson in your n=60 test, if only I could figure out what it is. I am not really surprised that a GPU is not good at sorting big arrays in parallel with old algorithms. It does not seem to be anything like what GPUs were designed to do: too many conditionals, branches, etc. Clearly, a different sorting algorithm is required if it’s to run efficiently on a GPU.

All in all, even though this is just a simplistic test, the results are extremely disappointing and I see no way out of the woods.

MatColgrove · March 7, 2011, 5:58pm

Hi Peter,

Since N needs to be very large and therefore where you to parallelize, I wondering if the if the algorithm can be changed with this in mind? In particular, could the one kernel be broken up into multiple kernels so that you can perform the sort in parallel? While I haven’t implemented a parallel sorting algorithm myself, there are several examples in CUDA C that you might be able to apply to CUDA Fortran.

Mat

Peter_Nightingale · March 8, 2011, 12:19am

Mat,

You wrote

Since N needs to be very large and therefore where you to parallelize, I wondering if the if the algorithm can be changed with this in mind?

That thought had occurred to me too, but I cannot even get the most trivial program to produce a decent speed up, which makes it hard to predict what would and what would not work.

At the bottom of this message is my latest contribution from the Land of the Lost Puppies. Once again a whole slow of anemic results: a factor 5 speed up at most, which amounts to nothing if you are using only one of four CPUs.

Cflops= 2.178E+03 n= 1.600E+06 nthread= 32 nblock= 256 tot= 1.311E+10
Gflops= 1.081E+04 n= 1.600E+06 nthread= 32 nblock= 256 tot= 1.311E+10
Cflops= 2.325E+03 n= 8.000E+05 nthread= 64 nblock= 256 tot= 1.311E+10
Gflops= 1.203E+04 n= 8.000E+05 nthread= 64 nblock= 256 tot= 1.311E+10
Cflops= 2.310E+03 n= 4.000E+05 nthread= 128 nblock= 256 tot= 1.311E+10
Gflops= 1.214E+04 n= 4.000E+05 nthread= 128 nblock= 256 tot= 1.311E+10
Cflops= 2.309E+03 n= 2.000E+05 nthread= 256 nblock= 256 tot= 1.311E+10
Gflops= 1.223E+04 n= 2.000E+05 nthread= 256 nblock= 256 tot= 1.311E+10
Cflops= 2.293E+03 n= 1.000E+05 nthread= 512 nblock= 256 tot= 1.311E+10
Gflops= 1.225E+04 n= 1.000E+05 nthread= 512 nblock= 256 tot= 1.311E+10
Cflops= 2.277E+03 n= 5.000E+04 nthread= 1024 nblock= 256 tot= 1.311E+10
Gflops= 1.215E+04 n= 5.000E+04 nthread= 1024 nblock= 256 tot= 1.311E+10
Cflops= 2.248E+03 n= 2.499E+04 nthread= 1024 nblock= 512 tot= 1.310E+10
Gflops= 1.204E+04 n= 2.499E+04 nthread= 1024 nblock= 512 tot= 1.310E+10
Cflops= 2.193E+03 n= 1.250E+04 nthread= 1024 nblock= 512 tot= 6.552E+09
Gflops= 6.350E+03 n= 1.250E+04 nthread= 1024 nblock= 512 tot= 6.552E+09

Gflops is the number of floating point operations per second on the GPU; Cflops the same on the CPU. tot is the total number of floating point operations. Of course, I’m assuming that the one loop that does the computing is sufficiently unrolled that the loop overhead can be ignored.

Of course, the following is not a serious program, but however that may be, I can’t think of anything that would come closer to satisfying all the criteria for GPu success that you listed earlier in this exchange.

!pgfortran -ta=nvidia -O3 speed-transpose.CUF -o speedGPU
!pgfortran -Mcuda=emu -O3 speed-transpose.CUF -o speedCPU

module d_sub
  use cudafor
  integer, parameter :: INTEGER8=8,INTEGER4=4 
contains
  attributes(global) subroutine crank(seed,n,nthread,nblock)
   implicit none
    integer(INTEGER8) :: seed(nblock,nthread)
    integer(INTEGER8) :: seedl
    integer, value :: nthread,nblock
    integer(8), value :: n
    integer(INTEGER4) :: x
    integer(8) :: i,j
    seedl=seed(blockidx%x,threadidx%x)
    do i=1,n
       !do i=1,n/16
       !  do j=1,16
       x=cong(seedl)
       !  end do
    end do
    seed(blockidx%x,threadidx%x)=seedl
  end subroutine crank
  attributes(device) function cong(seed)
    implicit none
    integer(INTEGER4) :: cong
    integer(INTEGER8) :: seed
    seed=123456789_INTEGER8*seed+987654321_INTEGER8
    ! cong=ishft(seed,-32)
    cong=seed
  end function cong

  subroutine callcrank(n,nthread,nblock)
    implicit none
    integer(INTEGER8), allocatable, device :: d_seed(:,:)
    integer(INTEGER8),allocatable :: seed(:,:)
    integer :: nthread,nblock,istat
    integer(8) :: n
    allocate(seed(nblock,nthread),d_seed(nblock,nthread),stat=istat)
    if(istat /= 0) stop 'allocation failed'
    d_seed=1
    call crank<<<nblock,nthread>>>(d_seed,n,nthread,nblock)
    istat=cudathreadsynchronize()
    seed=d_seed
    write(*,*) 'istat,seed',istat,seed(1,1)
  end subroutine callcrank
end module d_sub

program main
 use d_sub
 implicit none
 integer :: t1,t2,ticks,nthread,nblock,scale
 integer(8) :: n
 real :: time,tot
 character*20 :: input
 call getarg(1,input)
 read(input,*) n
 n=(n/16)*16
 call getarg(2,input)
 read(input,*) nthread
 call getarg(3,input)
 read(input,*) nblock
 call system_clock(t1)
 call callcrank(n,nthread,nblock)
 call system_clock(t2,ticks)
 time=real(t2-t1)/ticks
 tot=real(n)*nthread*nblock
 write(*,1000) 'flops=',tot/(10**6*time),&
      ' n=',real(n),&
      ' nthread=',nthread,&
      ' nblock=',nblock,&
      ' tot=',tot
1000 format((a),es10.3,(a),es10.3,(a),i5,(a),i5,(a),es10.3)
end program main

MatColgrove · March 8, 2011, 7:48pm

Hi Peter,

While not every program will work well on a GPU, I wouldn’t give up just yet.

What I would work on “crank” and try to parallelize the inner do loop. One way would be to add an extra dimension to seed to hold the N values and then perform a parallel reduction if you need to collapse this dimension. The problem here is that you may run out of memory and/or GPU threads if N is too large.

Another strategy would be to simplify the problem by removing the ‘nblock’ and ‘nthread’ altogether and have seed be a single dimension of ‘n’ elements. Have each thread work on a single ‘cong’ operation (though you’ll want to actual kernel perform more operations). For very large values of N, you may need use both the X and Y grid dimensions as well as the the X, Y, and Z block dimensions. The max number of GPU threads is a grid of 64K by 64K blocks with a block having a max of 512 of 1024 threads depending upon your device.

If I have time later, I’ll try and write-up an example. Though, I’m a bit swamped right now.

Mat

Peter_Nightingale · March 8, 2011, 8:31pm

While not every program will work well on a GPU, I wouldn’t give up just yet.

Unfortunately, I still have not been able to find anything at all that works. All these little programs that I have been cranking out are just simpleminded models of the real non-functioning program I wrote. However that may be, I really appreciate the attention you are giving to them!

I may have misunderstood something in the solution you propose, but I do not think that this one loop can be parallelized without changing the mathematics. The loop is of the form
do i=1,n
x=a*x+b
end do
which precludes vectorization and parallelization for the same reason, the x recursion.

Mathematically, in this particular, simple example the problem can be solved by computing the two-, three- or arbitrary number of steps back recursion, but in general that is not something one can count on.

Changing the number of indices or rearranging them will not work either: the loop that does all the work is performed on the unindexed local x, which I would expect lives in a register; you can’t beat that, it would seem. The stuff with the indices happens O(1) times out of an O(n) computation and therefore is irrelevant.

If indeed the trouble is that this loop cannot be vectorized, my experience going back to the 80s does not bode well; I may simply be re-discovering that vectorization is a loosing strategy for most of the physics we tend to be interested in.

MatColgrove · March 9, 2011, 4:13pm

Hi Peter,

You’re right. I just missed the backwards dependency in the second example since it’s in the cong function.

I’ll will pass your code around to see if anyone else here has some ideas.

Mat