Hi guys,

I inherited a project and the previous programmer use this code for a matrix multiplication. The code is tricky for me to understand because I’m new with cuda fortran. The code performs better than the libraries that I tried, so I hope to keep using it. I also need to understand it so that the code can be optimized for my Tesla k20.

My problem is that I don’t understand how the matrix multiplication is being broken up and distributed across threads and blocks in this code. I’ve traced out a few loops, but I’m not sure how threads are progressing relative to matrix elements. It seems that two elements of the resulting matrix C are being calculated by a single thread? Can someone confirm this or point me in the right direction?

For reference: Size of A(N,M), B(M,L), C(N,L), N = powers of 2 from 1024-8192 and M = powers of 2 from 2048-16384, and L = 8. blkSz = 16

The kernel is called by

mmul_kernel2<<<dim3(N/blkSz,1,1),dim3(blkSz,8,1)>>>(…)

```
attributes(global) SUBROUTINE mmul_kernel2( A, B, C, N, M, L )
USE PRECISION
USE SCP_VARIABLES_CUDA
IMPLICIT NONE
REAL(RP), DEVICE :: A(N,M), B(M,L), C(N,L)
INTEGER, VALUE :: N, M, L
INTEGER :: i, j, kb, k, kk
INTEGER :: tx, ty
REAL(RP), SHARED :: Asub1(blkSz,blkSz), Bsub(blkSz,8), Asub2(blkSz,blkSz)
REAL(RP) :: Cij, Cij2
tx = threadidx%x
ty = threadidx%y
i = (blockidx%x-1) * blkSz * 2 + tx
j = (blockidx%y-1) * 8 + ty
Cij = 0._rp
Cij2 = 0._rp
IF ( (i<=N).AND.(j<=L) ) THEN
DO kb = 1, M, blkSz
DO kk = 1, blkSz/8
Asub1(tx,ty+(kk-1)*8) = A(i ,kb+ty+(kk-1)*8-1)
Asub2(tx,ty+(kk-1)*8) = A(i+blkSz,kb+ty+(kk-1)*8-1)
END DO
Bsub(tx,ty) = B(kb+tx-1,j)
CALL syncthreads()
DO k = 1,blkSz
Cij = Cij + Asub1(tx,k) * Bsub(k,ty)
Cij2 = Cij2 + Asub2(tx,k) * Bsub(k,ty)
ENDDO
CALL syncthreads()
ENDDO
C(i ,j) = Cij
C(i+blkSz,j) = Cij2
END IF
RETURN
END SUBROUTINE mmul_kernel2
```