Problem in fast matrix multiplication

Hey guys, i am trying to implement the “Fast” matrix multiplication as stated in this power point presenation:

It does have a chunk of code and i am trying to implement it but at one point, on the slides there are using:

comp16(b, &ashare[k][0],c)

and i have no idea what it does. I tried googling for it but with no luck. Can anyone shed any light on it and if anyone has implemented this method for matrix multiplication, is it actually faster?

I am struggling big time, so i hope you can help me out a bit.

Thanks again.

This possibly originates from my matrix multiply code published here: Check calls to saxpy() there.

That code was optimized for G80 and was faster than CUBLAS at the time. I don’t think it is fast/faster on Fermi and Kepler.