Matrix Matrix Multiplication Kernel

Hi everyone

I have recently been going over the matrix matrix multiplication code from PGI but I have some trouble understanding how the kernel works.
The values for matrix dimensions as follows: N = 512, M = 512, L =1024

! start the module containing the matrix multiply kernel
module mmul_mod
    use cudafor

! mmul_kernel computes A*B into C where A is NxM, B is MxL, C is then NxL

    attributes(global) subroutine mmul_kernel( A, B, C, N, M, L )
       real,device :: A(N,M), B(M,L), C(N,L)
       integer, value :: N, M, L
       integer :: i, j, kb, k, tx, ty

! submatrices are declared to be in CUDA shared memory

       real, shared :: Asub(16,16), Bsub(16,16)

! the value of C(i,j) being computed, a temporary scalar

       real :: Cij

! Start execution, first get my thread indices

       tx = threadidx%x
       ty = threadidx%y

! This thread computes C(i,j) = sum(A(i,:) * B(:,j))

       i = (blockidx%x-1) * 16 + tx
       j = (blockidx%y-1) * 16 + ty

       Cij = 0.0

! Do the k loop in chunks of 16, the block size

       do kb = 1, M, 16

! Fill the submatrices; each of 16x16 threads in the thread block
! loads one element of Asub and Bsub

          Asub(tx,ty) = A(i,kb+ty-1)
          Bsub(tx,ty) = B(kb+tx-1,j)

! Wait until all elements are filled

          call syncthreads()

! Multiply the two submatrices; ! Each of the 16x16 threads accumulates the
! dot product for its element of C(i,j)

          do k = 1,16
             Cij = Cij + Asub(tx,k) * Bsub(k,ty)

! Synchronize to make sure all threads are done reading the submatrices before 
! overwriting them in the next iteration of the kb loop

          call syncthreads()


! Each of the 16x16 threads stores its element to the global C array

       C(i,j) = Cij

    end subroutine mmul_kernel

Although I have gone over the article that goes over the code I am not sure how the two 16x16 submatrices Asub and Bsub aid in the dot product of the row of A and the column of B. Also, I would like to understand more of how they came up with the equations for i and j from the thread index and the block index. I am sorry if this is too much of an easy question >_>




It’s only easy when you understand it (and I own the “easy question” record in this forum!). I think I can help you with the second bit (i, j and all); the first question (about tiling/submatrices) that’s sort of a big one. (If you have access to it, Kirk and Hwu’s “Programming Massively Parallel Processors” is a good read on the matrix multiply kernel.)

But, before you can tackle that, knowing how i and j are calculated is a need-to-know. I also puzzled over this for a time, and I wrote down my explanation of it (in 1-dimension) in this post. For 2-d exploring you can try experimenting with the code I posted in this post. I found that I had to write small programs that let me do this sort of “print out blockidx%x” or various quantities to figure it out. In fact, I still use them (a 1-d and 2-d version) to make sure I’m accessing what I’m thinking I am.

Those may not answer all (or any) of your questions, but maybe it’ll help.

Thank you!

I will start trying the codes right away to better understand the workings of the kernel’s threads and blocks. As for the book, was it very useful book? I am trying to specialize on Fermi architecture and Cuda Fortran so I might buy the book.


Well, the book is, of course, a CUDA C book in example, and focused on Tesla due to when it was written (there is a “future outlook” chapter which looks ahead to Fermi), but I think it’s a pretty good book. It’s sort of Hwu’s UIUC class in a condensed, more readable form. It does have some pretty interesting real-world case studies of the thought process behind CUDA-izing a problem that I like. (I do wish it used Fortran for the examples, but I’m a bit biased there.)