I inherited a project and the previous programmer use this code for a matrix multiplication. The code is tricky for me to understand because I’m new with cuda fortran. The code performs better than the libraries that I tried, so I hope to keep using it. I also need to understand it so that the code can be optimized for my Tesla k20.
My problem is that I don’t understand how the matrix multiplication is being broken up and distributed across threads and blocks in this code. I’ve traced out a few loops, but I’m not sure how threads are progressing relative to matrix elements. It seems that two elements of the resulting matrix C are being calculated by a single thread? Can someone confirm this or point me in the right direction?
For reference: Size of A(N,M), B(M,L), C(N,L), N = powers of 2 from 1024-8192 and M = powers of 2 from 2048-16384, and L = 8. blkSz = 16
The kernel is called by
attributes(global) SUBROUTINE mmul_kernel2( A, B, C, N, M, L ) USE PRECISION USE SCP_VARIABLES_CUDA IMPLICIT NONE REAL(RP), DEVICE :: A(N,M), B(M,L), C(N,L) INTEGER, VALUE :: N, M, L INTEGER :: i, j, kb, k, kk INTEGER :: tx, ty REAL(RP), SHARED :: Asub1(blkSz,blkSz), Bsub(blkSz,8), Asub2(blkSz,blkSz) REAL(RP) :: Cij, Cij2 tx = threadidx%x ty = threadidx%y i = (blockidx%x-1) * blkSz * 2 + tx j = (blockidx%y-1) * 8 + ty Cij = 0._rp Cij2 = 0._rp IF ( (i<=N).AND.(j<=L) ) THEN DO kb = 1, M, blkSz DO kk = 1, blkSz/8 Asub1(tx,ty+(kk-1)*8) = A(i ,kb+ty+(kk-1)*8-1) Asub2(tx,ty+(kk-1)*8) = A(i+blkSz,kb+ty+(kk-1)*8-1) END DO Bsub(tx,ty) = B(kb+tx-1,j) CALL syncthreads() DO k = 1,blkSz Cij = Cij + Asub1(tx,k) * Bsub(k,ty) Cij2 = Cij2 + Asub2(tx,k) * Bsub(k,ty) ENDDO CALL syncthreads() ENDDO C(i ,j) = Cij C(i+blkSz,j) = Cij2 END IF RETURN END SUBROUTINE mmul_kernel2