multiple small matrices multiplication

Hi, I want to do multiple small matrices multiplication.
But it seems that I can’t get a good speedup.
Even when I compile without -acc, it’s faster than acc.
I don’t know if the problem is about the data management or not…
Could anyone help me please?

program matrix_multiply
   use omp_lib
   use openacc
   implicit none
   integer :: i, j, k, m, n, p
   integer :: t1, t2, dt, count_rate, count_max
   integer :: td(4)=(/16,64,256,1024/)
   real, allocatable, dimension(:,:) :: a, b, c
   real :: tmp, secs

   call system_clock(count_max=count_max, count_rate=count_rate)

      n = 20
      allocate( a(n,n), b(n,n), c(n,n) )

!$acc data copyin(a,b) copy(c)
  do m = 1,4

    call system_clock(t1)
    do p = 1,td(m)           ! td={16,64,256,1024}

      ! Initialize matrices
      do j=1,n
         do i=1,n
            a(i,j) = real(i + j)
            b(i,j) = real(i - j)
         enddo
      enddo

      !$acc kernels
      ! Compute matrix multiplication.
      do j=1,n
         do i=1,n
            tmp = 0.0 
            do k=1,n
               tmp = tmp + a(i,k) * b(k,j)
            enddo
            c(i,j) = tmp
         enddo
      enddo
      !$acc end kernels

    enddo

    call system_clock(t2)
    dt = t2-t1
    secs = real(dt)/real(count_rate)
    write(*,*)"For domain_num=",td(m)," wall clock time is ", secs

  enddo
!$acc end data

      deallocate(a, b, c)

end program matrix_multiply

Thanks

Bean

Hi Bean,

The performance issue is because n is very small. Because of this, the overhead of launching code on the device is dominating the performance given the GPU has so little work. Try increasing n to something much larger, like 1000 or even 10,000.

Secondly, the code as written will give you wrong answers. You are updating a and b on the host but not copying them to the device. To fix, either move the loop that initializes a and b to device, or add an update directive.

Here’s an example:

% cat mm.f90
    program matrix_multiply
    use omp_lib
    use openacc
    implicit none
    integer :: i, j, k, m, n, p
    integer :: t1, t2, dt, count_rate, count_max
    integer :: td(4)=(/16,64,256,1024/)
    real, allocatable, dimension(:,:) :: a, b, c
    real :: tmp, secs

    call system_clock(count_max=count_max, count_rate=count_rate)

       n = 1000
       allocate( a(n,n), b(n,n), c(n,n) )

!$acc data create(a,b) copyout(c)
   do m = 1,4

     call system_clock(t1)
     do p = 1,td(m)           ! td={16,64,256,1024}

       ! Initialize matrices
       !$acc kernels
       do j=1,n
          do i=1,n
             a(i,j) = real(i + j)
             b(i,j) = real(i - j)
          enddo
       enddo

       ! Compute matrix multiplication.
       do j=1,n
          do i=1,n
             tmp = 0.0
             do k=1,n
                tmp = tmp + a(i,k) * b(k,j)
             enddo
             c(i,j) = tmp
          enddo
       enddo
       !$acc end kernels

     enddo

     call system_clock(t2)
     dt = t2-t1
     secs = real(dt)/real(count_rate)
     write(*,*)"For domain_num=",td(m)," wall clock time is ", secs

   enddo
 !$acc end data
    print *, C(1,1), C(n,n)
       deallocate(a, b, c)

 end program matrix_multiply

% pgf90 -fast mm.f90 -acc -Minfo=accel -ta=tesla:cc35,autocollapse
matrix_multiply:
     16, Generating create(a(:,:),b(:,:))
         Generating copyout(c(:,:))
     24, Loop is parallelizable
     25, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         24, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         25,   ! blockidx%x threadidx%x collapsed
     32, Loop is parallelizable
     33, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         32, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         33,   ! blockidx%x threadidx%x collapsed
     35, Loop is parallelizable
% setenv PGI_ACC_TIME 1
% ./a.out
 For domain_num=           16  wall clock time is    0.3292500
 For domain_num=           64  wall clock time is     1.328838
 For domain_num=          256  wall clock time is     5.282001
 For domain_num=         1024  wall clock time is     21.20549
   3.3383226E+08  -6.6616608E+08

Accelerator Kernel Timing data
/local/home/colgrove/mm.f90
  matrix_multiply  NVIDIA  devicenum=0
    time(us): 28,065,864
    16: data region reached 1 time
        51: data copyout transfers: 1
             device time(us): total=30 max=30 min=30 avg=30
    23: compute region reached 1360 times
        25: kernel launched 1360 times
            grid: [7813]  block: [128]
             device time(us): total=76,201 max=63 min=56 avg=56
            elapsed time(us): total=111,383 max=328 min=59 avg=81
        33: kernel launched 1360 times
            grid: [7813]  block: [128]
             device time(us): total=27,989,633 max=22,529 min=13,545 avg=20,580
            elapsed time(us): total=27,765,122 max=22,556 min=58 avg=20,415

Hope this helps,
Mat

Hi Mat,

Thanks for your help! It really helps me a lot and make me get clearer.

In my own program, I have to do lots of multiplications of small matrices. Is there any other ways to accelerate my problem? Or I’m thinking to put these small matrices together to become a bigger matrix first, then do the multiplication. But I’m worried this way will be better or not.

Thanks again,

Bean

Hi Bean,

Apologies, I should have been more careful reading your post and realized you want to try and accelerate small matrices.

In the original code, there’s not much that can be done. There’s simply not enough work in computing a single 20x20 matmul to warrant moving it to the GPU. Of course, if it’s part of a larger application and this is just a small portion of the code, then sure, but if that’s all your program does, then best to stay on the CPU.

However, if I can presume that each matmul can be done independent of the other, then we can asynchronously launch many matmul concurrently to fully saturate the GPU. I’ve updated the code to use the OpenACC “async” clause.

Note that there is a warm-up cost to creating async queues which given the very small overall time, is dominating the GPU runtime. Hence, the GPU time still isn’t as fast as the CPU, but is much better than the initial run. Again, best to stay on the CPU if this is the entire program, but if more work can be put over on the GPU and the warm-up cost can be amortized, then it’s worth continuing to explore using the GPU.

% cat test.f90
    program matrix_multiply
     use omp_lib
     use openacc
     implicit none
     integer :: i, j, k, m, n, p
     integer :: t1, t2, dt, count_rate, count_max
     integer :: td(4)=(/16,64,256,1024/)
     real, allocatable, dimension(:,:,:) :: a, b, c
     real :: tmp, secs

     call system_clock(count_max=count_max, count_rate=count_rate)

        n = 20
        allocate( a(n,n,1024), b(n,n,1024), c(n,n,1024) )

 !$acc data create(a,b) copyout(c)
    do m = 1,4

      call system_clock(t1)
      do p = 1,td(m)           ! td={16,64,256,1024}

        ! Initialize matrices
        !$acc kernels async(p)
        !$acc loop collapse(2) gang vector(128)
        do j=1,n
           do i=1,n
              a(i,j,p) = real(i + j)
              b(i,j,p) = real(i - j)
           enddo
        enddo

        ! Compute matrix multiplication.
        !$acc loop collapse(2) gang vector(128)
        do j=1,n
           do i=1,n
              tmp = 0.0
              do k=1,n
                 tmp = tmp + a(i,k,p) * b(k,j,p)
              enddo
              c(i,j,p) = tmp
           enddo
        enddo
        !$acc end kernels

      enddo
!$acc wait
      call system_clock(t2)
      dt = t2-t1
      secs = real(dt)/real(count_rate)
      write(*,*)"For domain_num=",td(m)," wall clock time is ", secs

    enddo
  !$acc end data
     print *, C(1,1,1), C(n,n,1024)
        deallocate(a, b, c)

  end program matrix_multiply
% pgfortran test.f90 -o test_cpu.out -fast
% pgfortran -acc -Minfo=accel test.f90 -ta=tesla:cc35 -o test_gpu.out -fast
matrix_multiply:
     16, Generating create(a(:,:,:),b(:,:,:))
         Generating copyout(c(:,:,:))
     23, Generating Tesla code
     25, Loop is parallelizable
     26, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         26,   ! blockidx%x threadidx%x collapsed
     34, Loop is parallelizable
     35, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         35,   ! blockidx%x threadidx%x collapsed
     37, Loop is parallelizable
% ./test_cpu.out
 For domain_num=           16  wall clock time is    5.5000000E-04
 For domain_num=           64  wall clock time is    9.8899996E-04
 For domain_num=          256  wall clock time is    1.4390000E-03
 For domain_num=         1024  wall clock time is    5.3679999E-03
    2850.000       -5130.000
% ./test_gpu.out
 For domain_num=           16  wall clock time is    5.4899999E-03
 For domain_num=           64  wall clock time is    3.2889999E-03
 For domain_num=          256  wall clock time is    1.3246000E-02
 For domain_num=         1024  wall clock time is    5.2797999E-02
    2850.000       -5130.000

Accelerator Kernel Timing data
./test.f90
  matrix_multiply  NVIDIA  devicenum=0
    time(us): 12,336
    16: data region reached 1 time
        53: data copyout transfers: 1
             device time(us): total=30 max=30 min=30 avg=30
    23: compute region reached 1360 times
        26: kernel launched 1360 times
            grid: [4]  block: [128]
             device time(us): total=5,418 max=6 min=3 avg=3
            elapsed time(us): total=33,490 max=341 min=22 avg=24
        35: kernel launched 1360 times
            grid: [4]  block: [128]
             device time(us): total=6,888 max=7 min=5 avg=5
            elapsed time(us): total=34,907 max=47 min=24 avg=25

FYI, at least on my system, the point at which the GPU becomes faster than the CPU is when n > 64.

Hope this helps,
Mat

Hi Mat,

It’s really helpful!

Thanks a lot!!!


Bean

Hi,
because I have to do several times for matrix multiplication, I try to make this part as a subroutine. It works while I compile on CPU.

However, when I compile with “-acc”, my results get strange.
Only c(1,1,1) shows the right answer, and the others show different results each time. Moreover, c(n,n,n) can get the right answer sometimes…

so…
Do you know where the problem is?

program test_routine

  implicit none
  integer :: i,j,k,m,n,p
  integer :: t1, t2, dt, count_rate, count_max
  integer :: td(4)=(/16,64,256,1024/)
  real, allocatable, dimension(:,:,:) :: a, b, c
  real :: tmp, secs
  !$acc routine(mm) gang

  call system_clock(count_max=count_max, count_rate=count_rate)

  n = 20

  allocate( a(n,n,1024), b(n,n,1024), c(n,n,1024) )
  
  !$acc data create(a,b) copyout(c(1:n,1:n,1:1024))
  do m = 1,4

    call system_clock(t1)

    do p = 1,td(m) ! td = {16,64,256,1024}

      ! Initialize matrices
      !$acc kernels async(p)
      !$acc loop collapse(2) gang vector(128)
      do j=1,n
         do i=1,n
            a(i,j,p) = real(i + j)
            b(i,j,p) = real(i - j)
         enddo
      enddo
      !$acc end loop 
      ! Compute matrix multiplication

      call mm(a,b,c,n,p,1024)
      !$acc end kernels

    enddo

    !$acc update host(c)

    call system_clock(t2)
    dt = t2-t1
    secs = real(dt)/real(count_rate)
    write(*,*)"For domain_num=",td(m)," wall clock time is ",secs
    write(*,*) c(1,1,1),c(2,2,1),c(2,2,2),c(n,n,n)
  enddo

  deallocate(a, b, c)

end program test_routine

subroutine mm(aa,bb,cc,n,p,td)
  !$acc routine gang
  implicit none
  integer :: i,j,k,n,p,td
  real :: aa(n,n,td),bb(n,n,td),cc(n,n,td)
  real :: tmp

!$acc loop collapse(2) gang vector(128)
! Compute matrix multiplication.
      do j=1,n
         do i=1,n
            tmp = 0.0  ! enables ACC parallelism for k-loop
            do k=1,n
               tmp = tmp + aa(i,k,p) * bb(k,j,p)
            enddo
            cc(i,j,p) = tmp
         enddo
      enddo
!$omp end parallel do

end subroutine mm

Hi Bean,

There’s a couple of things going on. Your program has a few small errors, missing “wait” directive and end data, but these are easily fixed. The cause of the error appears to be due to a race condition on the “p” variable.

Since “p” is being passed by reference to “mm”, the compiler must make it global and assume that it get’s updated in the callee. Before the compute region is launched, “p” gets updated on the device. However within an asynchronous context, the value of “p” will change depending upon which ever asynchronous queue just updated it. I added TPR#21745 to see what, if anything, the compiler can do here.

The solutions are to either not use a “routine gang” and instead have “mm” use it’s own compute region, or pass “p” by value to “mm”. However, to use the “value” attribute, “mm” must have an F90 interface so you’ll either need to put “mm” in a module or write an explicit interface.

Here’s two examples. In the second I put “mm” in a module but also use “parallel” regions instead of “kernels”. “parallel” is better when calling an OpenACC “routine” since it allows you to set the vector length. Without using “vector_length”, the vector length would be “1” since the compiler has no visibility in how the “routine” is parallelized.

% cat Bean.F90
program test_routine

   implicit none
   integer :: i,j,k,m,n,p
   integer :: t1, t2, dt, count_rate, count_max
   integer :: td(4)=(/16,64,256,1024/)
   real, allocatable, dimension(:,:,:) :: a, b, c
   real :: tmp, secs

   call system_clock(count_max=count_max, count_rate=count_rate)

   n = 20

   allocate( a(n,n,1024), b(n,n,1024), c(n,n,1024) )

   !$acc data create(a,b) copyout(c(1:n,1:n,1:1024))
   do m = 1,4

     call system_clock(t1)

     do p = 1,td(m) ! td = {16,64,256,1024}

       ! Initialize matrices
       !$acc kernels async(p)
       !$acc loop collapse(2) gang vector(128)
       do j=1,n
          do i=1,n
             a(i,j,p) = real(i + j)
             b(i,j,p) = real(i - j)
          enddo
       enddo
       !$acc end loop
       !$acc end kernels

       ! Compute matrix multiplication
       call mm(a,b,c,n,p,1024)

     enddo
     !$acc wait
     !$acc update host(c)

     call system_clock(t2)
     dt = t2-t1
     secs = real(dt)/real(count_rate)
     write(*,*)"For domain_num=",td(m)," wall clock time is ",secs
     write(*,*) c(1,1,1),c(2,2,1),c(2,2,2),c(n,n,n)
   enddo
   !$acc end data
   deallocate(a, b, c)

 end program test_routine

 subroutine mm(aa,bb,cc,n,p,td)
   implicit none
   integer :: i,j,k,n,p,td
   real :: aa(n,n,td),bb(n,n,td),cc(n,n,td)
   real :: tmp

!$acc kernels loop collapse(2) gang vector(128) async(p) &
!$acc  present(aa,bb,cc)
 ! Compute matrix multiplication.
       do j=1,n
          do i=1,n
             tmp = 0.0  ! enables ACC parallelism for k-loop
             do k=1,n
                tmp = tmp + aa(i,k,p) * bb(k,j,p)
             enddo
             cc(i,j,p) = tmp
          enddo
       enddo

 end subroutine mm
% pgf90 Bean.F90 -Minfo=accel -acc
test_routine:
     16, Generating create(a(:,:,:),b(:,:,:))
         Generating copyout(c(1:n,1:n,1:1024))
     26, Loop is parallelizable
     27, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         26, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         27,   ! blockidx%x threadidx%x collapsed
     40, Generating update host(c(:,:,:))
mm:
     59, Generating present(aa(:,:,:),bb(:,:,:),cc(:,:,:))
     62, Loop is parallelizable
     63, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         62, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         63,   ! blockidx%x threadidx%x collapsed
     65, Loop is parallelizable
% a.out
 For domain_num=           16  wall clock time is    2.2085000E-02
    2850.000        2790.000        2790.000        0.000000
 For domain_num=           64  wall clock time is    1.0190000E-03
    2850.000        2790.000        2790.000       -5130.000
 For domain_num=          256  wall clock time is    2.6860000E-03
    2850.000        2790.000        2790.000       -5130.000
 For domain_num=         1024  wall clock time is    1.0812000E-02
    2850.000        2790.000        2790.000       -5130.000



% cat Bean2.F90
module mod_mm

contains
 subroutine mm(aa,bb,cc,n,p,td)
   !$acc routine gang
   implicit none
   integer,value :: n,p,td
   integer :: i,j,k
   real :: aa(n,n,td),bb(n,n,td),cc(n,n,td)
   real :: tmp
 !$acc loop collapse(2) gang vector
 ! Compute matrix multiplication.
      do j=1,n
          do i=1,n
             tmp = 0.0  ! enables ACC parallelism for k-loop
  !$acc loop seq
             do k=1,n
                tmp = tmp + aa(i,k,p) * bb(k,j,p)
             enddo
             cc(i,j,p) = tmp
          enddo
       enddo

 end subroutine mm
end module mod_mm

program test_routine
   use mod_mm
   implicit none
   integer :: i,j,k,m,n,p
   integer :: t1, t2, dt, count_rate, count_max
   integer :: td(4)=(/16,64,256,1024/)
   real, allocatable, dimension(:,:,:) :: a, b, c
   real :: tmp, secs

   call system_clock(count_max=count_max, count_rate=count_rate)

   n = 20

   allocate( a(n,n,1024), b(n,n,1024), c(n,n,1024) )
   c=0.0
   !$acc data create(a,b) copyin(c(1:n,1:n,1:1024))
   do m = 1,4

     call system_clock(t1)

     do p = 1,td(m) ! td = {16,64,256,1024}

       ! Initialize matrices
       !$acc parallel loop collapse(2) gang vector async(p)
       do j=1,n
          do i=1,n
             a(i,j,p) = real(i + j)
             b(i,j,p) = real(i - j)
             c(i,j,p) = 0.0
          enddo
       enddo
       !$acc end loop

       ! Compute matrix multiplication
       !$acc parallel num_gangs(n) vector_length(128) async(p)
       call mm(a,b,c,n,p,1024)
       !$acc end parallel
     enddo
     !$acc wait
     !$acc update host(c)

     call system_clock(t2)
     dt = t2-t1
     secs = real(dt)/real(count_rate)
     write(*,*)"For domain_num=",td(m)," wall clock time is ",secs
     write(*,*) c(1,1,1),c(2,2,2),c(10,10,10),c(n,n,td(m))
   enddo
   !$acc end data
   deallocate(a, b, c)

 end program test_routine

% pgf90 Bean2.F90 -Minfo=accel -acc
mm:
      4, Generating acc routine gang
         Generating Tesla code
         13, !$acc loop gang, vector collapse(2) ! blockidx%x threadidx%x
         14,   ! blockidx%x threadidx%x collapsed
     13, Loop is parallelizable
     14, Loop is parallelizable
     17, Loop is parallelizable
test_routine:
     42, Generating create(a(:,:,:),b(:,:,:))
         Generating copyin(c(1:n,1:n,1:1024))
     50, Accelerator kernel generated
         Generating Tesla code
         51, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
         52,   ! blockidx%x threadidx%x collapsed
     61, Accelerator kernel generated
         Generating Tesla code
     66, Generating update host(c(:,:,:))
% a.out
 For domain_num=           16  wall clock time is    4.4800001E-03
    2850.000        2790.000        870.0000       -5130.000
 For domain_num=           64  wall clock time is    1.0010001E-03
    2850.000        2790.000        870.0000       -5130.000
 For domain_num=          256  wall clock time is    3.1010001E-03
    2850.000        2790.000        870.0000       -5130.000
 For domain_num=         1024  wall clock time is    1.3404000E-02
    2850.000        2790.000        870.0000       -5130.000