 # Matrix multiply: A divide and conquer method

Hi guys,

I inherited a project and the previous programmer use this code for a matrix multiplication. The code is tricky for me to understand because I’m new with cuda fortran. The code performs better than the libraries that I tried, so I hope to keep using it. I also need to understand it so that the code can be optimized for my Tesla k20.

My problem is that I don’t understand how the matrix multiplication is being broken up and distributed across threads and blocks in this code. I’ve traced out a few loops, but I’m not sure how threads are progressing relative to matrix elements. It seems that two elements of the resulting matrix C are being calculated by a single thread? Can someone confirm this or point me in the right direction?

For reference: Size of A(N,M), B(M,L), C(N,L), N = powers of 2 from 1024-8192 and M = powers of 2 from 2048-16384, and L = 8. blkSz = 16

The kernel is called by

mmul_kernel2<<<dim3(N/blkSz,1,1),dim3(blkSz,8,1)>>>(…)

``````attributes(global) SUBROUTINE mmul_kernel2( A, B, C, N, M, L )
USE PRECISION
USE SCP_VARIABLES_CUDA
IMPLICIT NONE

REAL(RP), DEVICE :: A(N,M), B(M,L), C(N,L)
INTEGER, VALUE :: N, M, L
INTEGER :: i, j, kb, k, kk
INTEGER :: tx, ty
REAL(RP), SHARED :: Asub1(blkSz,blkSz), Bsub(blkSz,8), Asub2(blkSz,blkSz)
REAL(RP) :: Cij, Cij2

i = (blockidx%x-1) * blkSz * 2  + tx
j = (blockidx%y-1) * 8 + ty

Cij  = 0._rp
Cij2 = 0._rp

IF ( (i<=N).AND.(j<=L) ) THEN

DO kb = 1, M, blkSz

DO kk = 1, blkSz/8
Asub1(tx,ty+(kk-1)*8) = A(i      ,kb+ty+(kk-1)*8-1)
Asub2(tx,ty+(kk-1)*8) = A(i+blkSz,kb+ty+(kk-1)*8-1)
END DO
Bsub(tx,ty)   = B(kb+tx-1,j)

DO k = 1,blkSz
Cij  = Cij  + Asub1(tx,k) * Bsub(k,ty)
Cij2 = Cij2 + Asub2(tx,k) * Bsub(k,ty)
ENDDO

ENDDO

C(i      ,j) = Cij
C(i+blkSz,j) = Cij2
END IF
RETURN

END SUBROUTINE mmul_kernel2
``````

Wow! CUDA Fortran! It exists!!

My understanding is that two elements of C are computed by each thread, not 8 threads. Thread block size is 16-by-8 and each thread block computes 32-by-8 block in C. First it loads 32x16 block from A, then 16-by-8 block from B, multiplies them and loops back to fetch the next blocks. When all blocks are processed, it writes the resulting 32-by-8 block in C to memory.

What I don’t understand is the grid size. I think it should be two times smaller, i.e. N/blkSz/2, not N/blkSz, because you compute 2 outputs per thread.

Thanks for your help. I was able to confirm what you said by changing around the code. Also, you were right about the grid size being too big. Thanks again!