matrix multiplication with some modification

Dear all:

I modified the example code of matrix multiplication in the file “CUDA Fortran Programming Guide and Reference” as below because I tried to apply for the arbitrary dimensions.
It’s seemed something wrong when I tested with two matrices Adev(568,568) and Bdev(568, 2902).
There were always errors larger than 1.E-3…:(
The dimensions of grid and block were

dimGrid = dim3( (568-1)/16+1, (2902-1)/16+1, 1 )
dimBlock = dim3( 16, 16, 1 )

How should I modify my code?
Thank you in advance.

Feng

    attributes(global) subroutine gpu_cal_coef( Adev, Bdev, Cdev, NB, M, L)
    implicit none
       integer, value :: NB, M, L
       real*8, device :: Adev(NB,M), Bdev(M,L), Cdev(NB,L)
       integer, device :: i, j, kb, k, tx, ty
       real*8, shared :: Asub(16,16), Bsub(16,16)
       real*8, device :: Cij

! Start execution, first get my thread indices
       tx = threadidx%x
       ty = threadidx%y

! This thread computes C(i,j) = sum(A(i,:) * B(:,j))
       i = (blockidx%x-1)*16 + tx
       j = (blockidx%y-1)*16 + ty

       Cij = 0.d0

       do kb = 1, M, 16
          if (i<=NB .and. kb+ty-1<=M)then        !<--modification
            Asub(tx,ty) = Adev(i,kb+ty-1)
          else
            Asub(tx,ty) = 0.d0                            !<--modification
          end if
		  
          if (kb+tx-1<=M .and. j<=L)then          !<--modification
            Bsub(tx,ty) = Bdev(kb+tx-1,j)
          else
            Bsub(tx,ty) = 0.d0                            !<--modification
          end if

          call syncthreads()

          do k = 1,16
             Cij = Cij + Asub(tx,k)*Bsub(k,ty)
          enddo
          call syncthreads()

       enddo
       Cdev(i,j) = Cij

   end subroutine gpu_cal_coef

Hi Feng,

567/17=33 blocks. 33 block times 16 threads per block is only 528 elements.

Since this is integer division, if the number of elements is not evenly divisible by the number of threads in a block, you need to round up.

dimGrid = dim3( (568+15)/16, (2902+15)/16, 1 )

You then need to make sure you have guards which skip the excess threads (which it looks like you have).

  • Mat

Hi, Mat

Thank you for reminding.
The number of block is (567/16)+1 = 36. There are 576 elements larger than 567.
I got no idea what happened. :(

Feng

Hi Mat,

I found the Cdev(i,j)=Cij should be guarded too. That is:

if( i<=NB .and. j<=L )then
  Cdev(i,j)=Cij
end if

All the errors are less than 1.E-6.

Feng