Matrix multiply: A divide and conquer method

Hi guys,

I inherited a project and the previous programmer use this code for a matrix multiplication. The code is tricky for me to understand because I’m new with cuda fortran. The code performs better than the libraries that I tried, so I hope to keep using it. I also need to understand it so that the code can be optimized for my Tesla k20.

My problem is that I don’t understand how the matrix multiplication is being broken up and distributed across threads and blocks in this code. I’ve traced out a few loops, but I’m not sure how threads are progressing relative to matrix elements. It seems that two elements of the resulting matrix C are being calculated by a single thread? Can someone confirm this or point me in the right direction?

For reference: Size of A(N,M), B(M,L), C(N,L), N = powers of 2 from 1024-8192 and M = powers of 2 from 2048-16384, and L = 8. blkSz = 16

The kernel is called by


attributes(global) SUBROUTINE mmul_kernel2( A, B, C, N, M, L )
    REAL(RP), DEVICE :: A(N,M), B(M,L), C(N,L) 
    INTEGER, VALUE :: N, M, L 
    INTEGER :: i, j, kb, k, kk
    INTEGER :: tx, ty 
    REAL(RP), SHARED :: Asub1(blkSz,blkSz), Bsub(blkSz,8), Asub2(blkSz,blkSz)
    REAL(RP) :: Cij, Cij2
    tx = threadidx%x 
    ty = threadidx%y 
    i = (blockidx%x-1) * blkSz * 2  + tx 
    j = (blockidx%y-1) * 8 + ty 
    Cij  = 0._rp
    Cij2 = 0._rp
    IF ( (i<=N).AND.(j<=L) ) THEN
       DO kb = 1, M, blkSz 
          DO kk = 1, blkSz/8
             Asub1(tx,ty+(kk-1)*8) = A(i      ,kb+ty+(kk-1)*8-1)
             Asub2(tx,ty+(kk-1)*8) = A(i+blkSz,kb+ty+(kk-1)*8-1)
          END DO
          Bsub(tx,ty)   = B(kb+tx-1,j) 
          CALL syncthreads() 
          DO k = 1,blkSz
             Cij  = Cij  + Asub1(tx,k) * Bsub(k,ty) 
             Cij2 = Cij2 + Asub2(tx,k) * Bsub(k,ty)
          CALL syncthreads() 
       C(i      ,j) = Cij
       C(i+blkSz,j) = Cij2
    END IF
  END SUBROUTINE mmul_kernel2

Wow! CUDA Fortran! It exists!!

My understanding is that two elements of C are computed by each thread, not 8 threads. Thread block size is 16-by-8 and each thread block computes 32-by-8 block in C. First it loads 32x16 block from A, then 16-by-8 block from B, multiplies them and loops back to fetch the next blocks. When all blocks are processed, it writes the resulting 32-by-8 block in C to memory.

What I don’t understand is the grid size. I think it should be two times smaller, i.e. N/blkSz/2, not N/blkSz, because you compute 2 outputs per thread.

Thanks for your help. I was able to confirm what you said by changing around the code. Also, you were right about the grid size being too big. Thanks again!